This section discusses recommendations for reliability of scale scores, relationship between reliability and validity, and strategies for increasing the reliability of scales scores.
Standards for the level of reliability A widely cited hueristic for minimum reliability is 0.70 and Nunnally's (1978) textbook is often mentioned as the primary source for this standard. However, Nunnally actually said something rather different. He made a distinction between the early stages of
basic research (where he thought that a reliability of 0.70 or higher was sufficient) and the situation where important decisions were being made about candidates (where "a reliability of 0.90 is the minimum ... and a reliability of 0.95 should be considered the desirable standard. (p. 246)"). Nunnally's recommendation of {\rho}_{xx'}=0.95 for important decision-making was based on a concern that the cost of false negative errors in decision-making fall disproportionately on the candidates. Given sufficient applicants, the organization just hires a different candidate or admits a different student, but the candidate suffers the loss of an opportunity. Nunnally's recommendations may therefore be cast as a recommendation for ethical decision-making via assessment. But he did not explicitly address any "costs" associated with attaining a high reliability. For example, if a 50-item test has reliability {\rho}_{xx'}=0.80, then the test length required for a reliability of {\rho}^*_{xx'}=0.95 is about 238 items (and the additional items must be comparable to the existing items). Nunnally did not address how an organization would fund an exam of 238 items nor how examinees would feel about sitting an exam with 238 items. He did not discuss whether fatigue effects might negatively impact candidate scores, nor whether extremely long test administration windows might disadvantage some classes of candidates.
Trade-off with validity A high value of reliability can conflict with content validity if a psychometrician removes items to maximize an estimate like coefficient alpha without regard to the content of the remaining items. Repeatedly measuring essentially the same question in different ways is often used solely to increase reliability while damaging content validity.
Trade-off with efficiency When the other conditions are equal, reliability increases as the number of items increases. However, the increase in the number of items hinders the efficiency of measurements.
Methods to increase reliability The following methods can be considered to increase reliability. Before
data collection: • Eliminate the ambiguity of the measurement item. • Do not measure what the respondents do not know. • Increase the number of items. • Use a scale that is known to be highly reliable. After data collection: • Use item-analysis to identify and remove problematic items (carefully avoiding damage to content validity). ==See also==