When is a measurement considered reliable
When researchers measure a construct that they assume to be consistent across time, then the scores they obtain should also be consistent across time. Test-retest reliability is the extent to which this is actually the case.
For example, intelligence is generally thought to be consistent across time. A person who is highly intelligent today will be highly intelligent next week. This means that any good measure of intelligence should produce roughly the same scores for this individual next week as it does today. Clearly, a measure that produces highly inconsistent scores over time cannot be a very good measure of a construct that is supposed to be consistent.
Assessing test-retest reliability requires using the measure on a group of people at one time, using it again on the same group of people at a later time, and then looking at test-retest correlation between the two sets of scores. Figure 5. Again, high test-retest correlations make sense when the construct being measured is assumed to be consistent over time, which is the case for intelligence, self-esteem, and the Big Five personality dimensions.
But other constructs are not assumed to be stable over time. The very nature of mood, for example, is that it changes. So a measure of mood that produced a low test-retest correlation over a period of a month would not be a cause for concern.
On the Rosenberg Self-Esteem Scale, people who agree that they are a person of worth should tend to agree that that they have a number of good qualities. This is as true for behavioural and physiological measures as for self-report measures. For example, people might make a series of bets in a simulated game of roulette as a measure of their level of risk seeking. Like test-retest reliability, internal consistency can only be assessed by collecting and analyzing data.
One approach is to look at a split-half correlation. This involves splitting the items into two sets, such as the first and second halves of the items or the even- and odd-numbered items. Then a score is computed for each set of items, and the relationship between the two sets of scores is examined.
For example, Figure 5. For example, there are ways to split a set of 10 items into two sets of five. Many behavioural measures involve significant judgment on the part of an observer or a rater. Inter-rater reliability is the extent to which different observers are consistent in their judgments.
Validity is the extent to which the scores from a measure represent the variable they are intended to. But how do researchers make this judgment? We have already considered one factor that they take into account—reliability.
When a measure has good test-retest reliability and internal consistency, researchers should be more confident that the scores represent what they are supposed to. There has to be more to it, however, because a measure can be extremely reliable but have no validity whatsoever.
Although this measure would have extremely good test-retest reliability, it would have absolutely no validity. Here we consider three basic kinds: face validity, content validity, and criterion validity. Most people would expect a self-esteem questionnaire to include items about whether they see themselves as a person of worth and whether they think they have good qualities.
So a questionnaire that included these kinds of items would have good face validity. The finger-length method of measuring self-esteem, on the other hand, seems to have nothing to do with self-esteem and therefore has poor face validity. Although face validity can be assessed quantitatively—for example, by having a large sample of people rate a measure in terms of whether it appears to measure what it is intended to—it is usually assessed informally. In order to determine if your measurements are reliable and valid, you must look for sources of error.
There are two types of errors that may affect your measurement, random and nonrandom. Random error consists of chance factors that affect the measurement. The more random error, the less reliable the instrument. The type of reliability assessed in this example is retest reliability. This is called the coefficient of stability. It is expressed as a correlation coefficient r which will range from 0 to 1. The closer to 1, the more reliable the measurement. Non-random error is systematic.
If the blood pressure cuff always reads high, then it affects all of the measurements. Non-random error affects the validity of the instrument. The type of validity assessed in this example is that of construct validity. The blood pressure cuff measures the construct as it is defined in the literature. These factors should ideally correspond to the underling theoretical constructs that we are trying to measure.
The general norm for factor extraction is that each extracted factor should have an eigenvalue greater than 1. The extracted factors can then be rotated using orthogonal or oblique rotation techniques, depending on whether the underlying constructs are expected to be relatively uncorrelated or correlated, to generate factor weights that can be used to aggregate the individual items of each construct into a composite measure.
For adequate convergent validity, it is expected that items belonging to a common construct should exhibit factor loadings of 0. A more sophisticated technique for evaluating convergent and discriminant validity is the multi-trait multi-method MTMM approach. This technique requires measuring each construct trait using two or more different methods e. This is an onerous and relatively less popular approach, and is therefore not discussed here.
Criterion-related validity can also be assessed based on whether a given measure relate well with a current or future criterion, which are respectively called concurrent and predictive validity. Predictive validity is the degree to which a measure successfully predicts a future outcome that it is theoretically expected to predict.
For instance, can standardized test scores e. Concurrent validity examines how well one measure relates to other concrete criterion that is presumed to occur simultaneously.
These scores should be related concurrently because they are both tests of mathematics. Unlike convergent and discriminant validity, concurrent and predictive validity is frequently ignored in empirical social science research. Exploratory factor analysis for convergent and discriminant validity. Now that we know the different kinds of reliability and validity, let us try to synthesize our understanding of reliability and validity in a mathematical manner using classical test theory , also called true score theory.
This is a psychometric theory that examines how measurement works, what it measures, and what it does not measure. This theory postulates that every observation has a true score T that can be observed accurately if there were no errors in measurement. However, the presence of measurement errors E results in a deviation of the observed score X from the true score as follows:. Across a set of observed scores, the variance of observed and true scores can be related using a similar equation:.
The goal of psychometric analysis is to estimate and minimize if possible the error variance var E , so that the observed score X is a good measure of the true score T. Measurement errors can be of two types: random error and systematic error.
Random error is the error that can be attributed to a set of unknown and uncontrollable external factors that randomly influence some observations but not others. As an example, during the time of measurement, some respondents may be in a nicer mood than others, which may influence how they respond to the measurement items.
For instance, respondents in a nicer mood may respond more positively to constructs like self-esteem, satisfaction, and happiness than those who are in a poor mood. However, it is not possible to anticipate which subject is in what type of mood or control for the effect of mood in research studies. Likewise, at an organizational level, if we are measuring firm performance, regulatory or environmental changes may affect the performance of some firms in an observed sample but not others.
Systematic error is an error that is introduced by factors that systematically affect all observations of a construct across an entire sample in a systematic manner. In our previous example of firm performance, since the recent financial crisis impacted the performance of financial firms disproportionately more than any other type of firms such as manufacturing or service firms, if our sample consisted only of financial firms, we may expect a systematic reduction in performance of all firms in our sample due to the financial crisis.
Unlike random error, which may be positive negative, or zero, across observation in a sample, systematic errors tends to be consistently positive or negative across the entire sample. Since an observed score may include both random and systematic errors, our true score equation can be modified as:.
The statistical impact of these errors is that random error adds variability e. What does random and systematic error imply for measurement procedures?
By increasing variability in observations, random error reduces the reliability of measurement. In contrast, by shifting the central tendency measure, systematic error reduces the validity of measurement.
Validity concerns are far more serious problems in measurement than reliability concerns, because an invalid measure is probably measuring a different construct than what we intended, and hence validity problems cast serious doubts on findings derived from statistical analysis. Note that reliability is a ratio or a fraction that captures how close the true score is relative to the observed score.
Hence, reliability can be expressed as:. A complete and adequate assessment of validity must include both theoretical and empirical approaches. As shown in Figure 7. The integrated approach starts in the theoretical realm. The first step is conceptualizing the constructs of interest.
Next, we select or create items or indicators for each construct based on our conceptualization of these construct, as described in the scaling procedure in Chapter 5. A literature review may also be helpful in indicator selection. Each item is reworded in a uniform manner using simple and easy-to-understand text. In this analysis, each judge is given a list of all constructs with their conceptual definitions and a stack of index cards listing each indicator for each of the construct measures one indicator per index card.
Judges are then asked to independently read each index card, examine the clarity, readability, and semantic meaning of that item, and sort it with the construct where it seems to make the most sense, based on the construct definitions provided. Inter-rater reliability is assessed to examine the extent to which judges agreed with their classifications.
Ambiguous items that were consistently missed by many judges may be reexamined, reworded, or dropped. The best items say for each construct are selected for further analysis. Each of the selected items is reexamined by judges for face validity and content validity. If an adequate set of items is not achieved at this stage, new items may have to be created based on the conceptual definition of the intended construct.
Two or three rounds of Q-sort may be needed to arrive at reasonable agreement between judges on a set of items that best represents the constructs of interest. Next, the validation procedure moves to the empirical realm.
A research instrument is created comprising all of the refined construct items, and is administered to a pilot test group of representative respondents from the target population.
0コメント