For psychologists, reliability and validity serve as quality control criteria.
The question of reliability is a critical one to ask of all measures. Formally, reliability is “the degree to which scores are free from unsystematic error.” If scores on a test are free from this, they’ll be consistent across related measurement. So, an easy way to think about reliability is the word “consistency.”
Note the word “unsystematic” in the definition. Scores on a measure may contain error but still be reliable. For example, if a thermometer is always 5 degrees low, it will be “reliable”–meaning consistent–but it’s inaccurate. Or, consider the case in which one professor is a harsh grader and one is a lenient grader. Each of them may be reliable, but one assigns grades that are systematically too low, and another gives grades that are systematically too high.
Types of Reliability
There are four primary types of reliability that psychologists use to evaluate an assessment’s scores. They are as follows:
1. Test-retest reliability. This is a measure of stability, and it asks the question: How stable are scores on the measure or test over time? Do you get the same results (or at least very close to the same) on two separate occasions? This is computed numerically using a correlation coefficient. The thing to keep in mind with test-retest reliability is that it only makes sense if the trait being measured is supposed to be stable. Most variables that career counselors want to assess (e.g., interests, values, personality, ability) are quite stable on average, once people reach early adulthood. (Otherwise it would make little sense to measure these things and use them to inform career decisions that may have long-term implications.)
2. Alternative forms reliability. This applies only to instruments for which there are two or more separate forms that are intended to be equivalent. For example, the SAT and ACT each have multiple forms. This is a measure of equivalence, and it asks this question: Do the two different versions of this measure give the same results? Are the two versions essentially the same? To test this, a test developer would administer the two versions of your assessment to the same people, and then calculate the correlation between these two sets of scores.
3. Internal Consistency Reliability. This is a measure of consistency within the test. How consistent are scores for the items on this test? Do all the items fit together? Do they all measure the same thing? There are two main types of internal reliability: split half and coefficent alpha.
- Split-half. To find the split half internal consistency reliability, you’d start by administering your measure’s items to a group of people. After everyone has taken it, you’d randomly separate the pool of items into two halves. You’d treat these halves as if they were each a separate measure of the construct. Then you’d calculate the correlation between the scores you obtain from each half of the test.
- Coefficient alpha. One problem with split-half reliability is that you may get a different measure of reliability for each way you split the measure in half. You could simply separate the even-numbered items and the odd-numbered items, but you could also take the first half and the second half, and your correlation may end up being different. Coefficient alpha is a type of reliability coefficient that test developers can calculate which is the average of all possible split-half reliability coefficients. Conceptually, you can think of it as essentially the same as splitting the measure in half many different times—as many different ways as is possible—and calculating the split-half reliability coefficient for each pair of halves, then calculating the average of all those split-half reliability coefficients. In practice there is a formula that makes calculating this easy. Coefficient alpha can also be thought of as an index of the degree to which each item contributes to the total score.
4. Inter-rater reliability. Sometimes you measure things not with a paper and pencil measure that the participant completes, but by having other observers rate the participant’s behavior, such as in a behavioral assessment. Reliability is still important in this case, but it looks a little different. Now the question is whether the two (or more) raters agree with each other.
To establish evidence for inter-rater reliability, you would have two raters rate the same participants. Then you would calculate the relationship between the two raters. Once you have established evidence for inter-rater reliability, you can say confidently that the rating that occurred was consistent.
Sometimes validity is referred to as addressing the question “does the test measure what it is designed to measure?” A broader and more appropriate way to think about validity is referring to it as addressing the question “does the test accomplish its intended purpose?” or “does the test meet the claims made for it?”
Types of Validity
There are three primary types of validity that psychologists use to evaluate an assessment’s scores. (The list below has four types, but the first doesn’t really count.) They are as follows:
1. Face validity. This really isn’t validity in a scientific sense. If a measurement instrument has face validity, it just means it looks like it measures what it’s supposed to measure. If you have a measure of interests and the items look like they are tapping into a person’s interests, you have face validity. This can be good if it builds rapport with the test-taker, but it doesn’t have any real scientific value. All things being equal, it’s good if you have it, but it doesn’t mean that the measure is accomplishing its intended purpose.
2. Content validity. Content validity refers to how well an assessment covers all relevant aspects of the domain it is supposed to measure. Conceptually, you could ensure content validity if you could write items to cover absolutely every detail about your construct. For example, if a test developer wanted to measure a particular style of leadership, that developer could write every single item she could possibly think of that would be relevant to that leadership style. Then she could take a random sample of those items and include them in the scale. Unfortunately, this is not only impractical, it is impossible.
Often, content validity is assessed by expert judgment. You could assess the content validity of a measure by having an expert or experts examine the items and determine whether the items are a good representation of the entire universe of items. If the leadership style measure described above exam has low content validity, it would probably include a lot of items that aren’t relevant to that style, and it would probably leave out items that are very relevant to that style.
3. Criterion-related validity. This refers to how well an assessment correlates with performance or whatever other criterion you’re interested in. It answers the question of “do scores on the measure allow us to infer information about performance on some criterion?” There are two types of criterion validity: concurrent and predictive.
- Concurrent validity. This refers to how well the test scores correlate with some criterion, when both measures are taken at the same time. For an interest inventory, for example, a good question is whether people who are, say, engineers score high on a scale designed to measure interest in engineering.
- Predictive validity. This refers to how well the test scores correlate with future criteria. For example, what percentage of people will end up in a career field down the road that corresponds to high scores on scales designed to measure a person’s interest in that field?
4. Construct validity. Construct validity most directly addresses the question of “does the test measure what it’s designed to measure?” It refers to how well the test assesses the underlying construct that is theorized. To demonstrate evidence of construct validity, a test developer would show that scores on her measure would have a strong relationship with certain variables (the ones that are very similar to what is being measured) and a weak relationship with other variables (those that are conceptualized as being dissimilar to the construct being measured). There are two types: convergent and discriminant.
- Convergent validity. Convergent validity is the extent to which scores on a measure are related to scores on measures of the same or similar constructs. For example, let’s say your personality test has an extraversion scale. You might expect that the more extraverted a person is, the more likely she or he is to have high levels of sociability. If there is a strong positive relationship between scores on your extraversion measure and the scores on measures of sociability, then your scale’s scores have evidence of convergent validity.
- Discriminant validity. Support for discriminant validity is demonstrated by showing that an assessment does not measure something it is not intended to measure. For example, if you have an extraversion scale in your measure, you might consider also administering a measure of emotional stability. We know that extraversion and emotional stability are two different things. If you asked people to take your scale and a measure of emotional stability and you find a small correlation between scores on these two scales, you have shown that your scale measures something other than emotional stability. Note that you haven’t shown what itdoes measure, you have just shown what it does not measure.
NOTE: An instrument’s scores can be reliable but not valid. However, if an instrument’s scores are not reliable, there is no way that they can be valid.
Why does this matter? If you want to know how “good” a career assessment is, ask this question: “What is the evidence of reliability and validity?” A counselor should be able to answer this question, and an online assessment portal should provide some basic information (and links to more detailed information) to show that its scores provide good information. If no information about a particular test can be found, we’d encourage you to avoid that assessment instrument, or at the very least to view scores generated by that assessment in only the most tentative way.