38 Qualitative Research
Qualitative research generates evidence that is not born as numbers: interview transcripts, open-ended survey responses, field notes, social-media posts, product reviews, and the imagery and video that increasingly dominate consumer expression. To make such evidence cumulative and defensible, researchers code it—assigning categorical or ordinal labels to passages, respondents, or images according to a coding scheme. The credibility of every downstream claim then hinges on a single question: would a second, independent analyst, applying the same scheme, have produced the same labels? If coding is idiosyncratic to the coder, the “findings” are an artifact of who did the reading, not a property of the data. Inter-rater reliability (IRR), also called inter-coder agreement, is the family of statistics that answers this question, and it is the methodological hinge on which qualitative rigor turns.
This concern is not parochial to ethnography. The same machinery governs the quality of any human-labeled dataset: relevance judgments in information retrieval, sentiment labels that train a classifier, diagnostic ratings in medicine, and—of growing importance in marketing—the gold-standard annotations used to validate text- and image-mining pipelines (Chapter 35). Whenever a learned model is benchmarked against “ground truth,” that ground truth is itself a set of human codes whose reliability must be quantified before the benchmark means anything. Reliability is necessary but not sufficient for validity: two coders can agree perfectly on a scheme that measures the wrong construct. The agreement coefficients in this chapter certify consistency; the meaning of the codes is established by the construct-validity arguments of Chapter 35.
The chapter develops the IRR coefficients as a single coherent system rather than a catalog. We begin with the deceptively simple idea of percent agreement and show precisely why it fails. Correcting that failure—subtracting the agreement expected by chance—yields the chance-corrected family that includes Cohen’s and Fleiss’s kappa, Krippendorff’s \(\alpha\), and Kendall’s \(W\). We then turn to the intraclass correlation coefficient (ICC) for continuous ratings, which embeds reliability inside a variance-components model and exposes the design assumptions the kappa family leaves implicit. Throughout, intuition leads, the estimator and its assumptions follow in full, and every coefficient is computed on reproducible data.
38.1 The Chance-Correction Problem
Every chance-corrected agreement coefficient in this chapter is a special case of one idea. Let \(p_o\) be the proportion of items on which raters observed agreement and let \(p_e\) be the proportion of agreement expected if raters labeled items independently at random. The corrected coefficient rescales observed agreement against the headroom that chance leaves:
\[ S = \frac{p_o - p_e}{1 - p_e} = 1 - \frac{1 - p_o}{1 - p_e}. \tag{38.1}\]
The numerator \(p_o - p_e\) is agreement beyond chance; the denominator \(1 - p_e\) is the maximum agreement beyond chance attainable. Thus \(S = 1\) for perfect agreement, \(S = 0\) when observed agreement equals chance, and \(S < 0\) when raters agree less than chance—a diagnostic signal of systematic disagreement, not mere noise. The coefficients differ only in how they define \(p_o\) and, above all, \(p_e\): how chance is modeled, how many raters and categories are allowed, whether categories are nominal or ordinal, and how partial credit is assigned to near misses. Holding 1 fixed and varying its two inputs organizes the entire family, summarized in Table 38.1.
| Coefficient | Raters | Scale | Chance corrected | Handles missing |
|---|---|---|---|---|
| Percent agreement | 2+ | Any | No | No |
| Cohen’s κ | 2 | Nominal | Yes | No |
| Weighted κ | 2 | Ordinal | Yes | No |
| Fleiss’s κ | 3+ | Nominal/Ordinal | Yes | Limited |
| Krippendorff’s α | 2+ | Nominal–Ratio | Yes | Yes |
| Kendall’s W | 3+ | Ordinal (ranks) | Yes (ties) | No |
| ICC | 2+ | Interval/Continuous | Yes | Limited |
38.1.1 Why Raw Agreement Misleads
The naive measure of reliability is percent agreement: the share of items two raters label identically, \[ p_o = \frac{\text{number of agreements}}{\text{number of items}} \times 100. \] Its appeal is transparency; its flaw is fatal. Percent agreement credits the agreement two raters would reach even if both labeled items by coin flip, and that baseline grows with category imbalance. Consider a rare event coded present in 5% of items. Two raters who blindly code everything “absent” agree 90% of the time and never once detect the event—yet percent agreement applauds them. The statistic conflates skill with the base rate, so it is uninterpretable across studies with different category distributions. Chance correction (1) exists precisely to strip out this base-rate-driven floor.
Code
data("diagnoses", package = "irr") # 30 patients, 6 raters, psychiatric dx
head(diagnoses, 5)
#> rater1 rater2 rater3
#> 1 4. Neurosis 4. Neurosis 4. Neurosis
#> 2 2. Personality Disorder 2. Personality Disorder 2. Personality Disorder
#> 3 2. Personality Disorder 3. Schizophrenia 3. Schizophrenia
#> 4 5. Other 5. Other 5. Other
#> 5 2. Personality Disorder 2. Personality Disorder 2. Personality Disorder
#> rater4 rater5 rater6
#> 1 4. Neurosis 4. Neurosis 4. Neurosis
#> 2 5. Other 5. Other 5. Other
#> 3 3. Schizophrenia 3. Schizophrenia 5. Other
#> 4 5. Other 5. Other 5. Other
#> 5 4. Neurosis 4. Neurosis 4. Neurosis
agree(diagnoses[, 1:2]) # raw percent agreement, raters 1 and 2
#> Percentage agreement (Tolerance=0)
#>
#> Subjects = 30
#> Raters = 2
#> %-agree = 73.3The diagnoses data record six psychiatrists independently assigning each of 30 patients to one of five diagnostic categories—a canonical nominal-coding problem isomorphic to coding open-ended responses into themes. Raw agreement looks reassuring, but we cannot yet say whether it reflects genuine reliability or the prevalence of common diagnoses.
38.2 Cohen’s Kappa: Two Raters, Nominal Categories
Cohen (1960) supplied the first widely adopted correction for two raters and nominal categories. Cohen’s \(\kappa\) takes \(p_o\) as the observed proportion of exact agreement and models chance under the assumption that each rater applies their own marginal distribution of category usage independently of the other:
\[ \kappa = \frac{p_o - p_e}{1 - p_e}, \qquad p_e = \sum_{k=1}^{K} \hat{p}_{1k}\,\hat{p}_{2k}, \tag{38.2}\]
where \(\hat{p}_{1k}\) and \(\hat{p}_{2k}\) are the marginal proportions with which rater 1 and rater 2 use category \(k\), and \(K\) is the number of categories. The product \(\hat{p}_{1k}\,\hat{p}_{2k}\) is the chance probability that both land on category \(k\) when acting independently; summing over \(k\) gives total expected agreement. Substituting into 1 yields the coefficient. Cohen’s \(\kappa\) presumes a fixed pair of raters (the marginals are specific to those two coders) and exhaustive, mutually exclusive categories, and it treats every disagreement as equally severe—appropriate when categories are nominal but wasteful when they are ordinal.
To read a \(\kappa\) value, the field relies on the verbal benchmarks of Landis and Koch (1977), reproduced in Table 38.2. These thresholds are conventions, not laws: they ignore \(K\) (a given \(\kappa\) is harder to achieve with few categories) and the cost of disagreement, so they should anchor judgment rather than replace it.
| Kappa range | Interpretation |
|---|---|
| < 0.00 | Poor (worse than chance) |
| 0.00 – 0.20 | Slight |
| 0.21 – 0.40 | Fair |
| 0.41 – 0.60 | Moderate |
| 0.61 – 0.80 | Substantial |
| 0.81 – 1.00 | Almost perfect |
38.2.1 Weighting Ordinal Disagreements
When categories are ordered—Likert intensities, severity grades, sentiment from very negative to very positive—treating a one-step miss as harshly as a five-step miss discards information. Weighted kappa generalizes Equation 38.2 by attaching a disagreement weight \(w_{jk}\) to each cell of the \(K \times K\) rating table, with \(w_{kk} = 0\) on the diagonal and weights increasing in the distance \(|j-k|\) off it. Both observed and expected disagreement are then weighted:
\[ \kappa_w = 1 - \frac{\sum_{j,k} w_{jk}\, p_{o,jk}}{\sum_{j,k} w_{jk}\, p_{e,jk}}, \tag{38.3}\]
where \(p_{o,jk}\) is the observed and \(p_{e,jk} = \hat{p}_{1j}\hat{p}_{2k}\) the chance proportion of the \((j,k)\) cell. Two weighting schemes dominate. Linear weights, \(w_{jk} = |j-k|\), penalize disagreement in proportion to the number of steps; quadratic weights, \(w_{jk} = (j-k)^2\), penalize large discrepancies far more steeply. The choice is substantive: quadratic weights are appropriate when a two-category error is much worse than two one-category errors, and—usefully—the quadratically weighted \(\kappa\) coincides with a two-way intraclass correlation under mild conditions, linking the kappa and ICC families.1
Code
dx2 <- diagnoses[, c("rater1", "rater2")]
kappa2(dx2, weight = "unweighted") # nominal: all disagreements equal
#> Cohen's Kappa for 2 Raters (Weights: unweighted)
#>
#> Subjects = 30
#> Raters = 2
#> Kappa = 0.651
#>
#> z = 7
#> p-value = 2.63e-12
kappa2(dx2, weight = "equal") # linear weights for ordinal scales
#> Cohen's Kappa for 2 Raters (Weights: equal)
#>
#> Subjects = 30
#> Raters = 2
#> Kappa = 0.633
#>
#> z = 5.43
#> p-value = 5.52e-08
kappa2(dx2, weight = "squared") # quadratic weights penalize large gaps
#> Cohen's Kappa for 2 Raters (Weights: squared)
#>
#> Subjects = 30
#> Raters = 2
#> Kappa = 0.655
#>
#> z = 3.91
#> p-value = 9.37e-05The accompanying \(z\)-test reports whether \(\kappa\) exceeds zero; a \(p\)-value below 0.05 indicates raters agree more than chance. That is a low bar—rejecting \(\kappa = 0\) is not evidence of good reliability—so the point estimate read against Table 38.2, with a confidence interval, remains the substantive summary. DescTools::CohenKappa() returns that interval directly.
Code
DescTools::CohenKappa(table(dx2$rater1, dx2$rater2), conf.level = 0.95)
#> kappa lwr.ci upr.ci
#> 0.6511628 0.4557884 0.846537238.3 Fleiss’s Kappa: Many Raters
Cohen’s \(\kappa\) is structurally limited to two raters. Coding teams routinely use three or more, and—as in the diagnoses study—the particular raters assigned to an item may vary. Equation 38.2 cannot accommodate this because its \(p_e\) is built from two specific raters’ marginals. Fleiss’s kappa generalizes to any number of raters by redefining agreement per item rather than per rater pair, and by assuming the raters for each item are a random draw from a larger pool, so a single category distribution describes chance for all items.
For \(n\) items, \(m\) raters per item, and \(K\) categories, let \(n_{ik}\) be the number of raters who assign item \(i\) to category \(k\). The per-item observed agreement is the proportion of agreeing rater pairs, \[ P_i = \frac{1}{m(m-1)} \left( \sum_{k=1}^{K} n_{ik}^2 - m \right), \] and \(\bar{P} = \tfrac{1}{n}\sum_i P_i\) is mean observed agreement. Expected agreement uses the pooled category proportions \(\bar{p}_k = \tfrac{1}{nm}\sum_i n_{ik}\), giving \[ \bar{P}_e = \sum_{k=1}^{K} \bar{p}_k^2, \qquad \kappa_{\text{Fleiss}} = \frac{\bar{P} - \bar{P}_e}{1 - \bar{P}_e}. \tag{38.4}\]
This is again 1, now with multi-rater definitions of observed and expected agreement. The key identifying assumption—that raters are exchangeable draws from a common pool—is what permits a single \(\bar{p}_k\) to stand in for every item’s chance baseline; it is appropriate when coders are interchangeable but not when specific, non-substitutable experts are deliberately paired.
Code
kappam.fleiss(diagnoses) # all six raters; no fixed-pairing assumption
#> Fleiss' Kappa for m Raters
#>
#> Subjects = 30
#> Raters = 6
#> Kappa = 0.43
#>
#> z = 17.7
#> p-value = 0A related multi-rater measure, Light’s kappa, instead averages Cohen’s \(\kappa\) over every pair of raters (Cohen 1960). It retains Cohen’s pairwise, fixed-rater chance model while summarizing across the team, and is preferable when raters are not exchangeable—each pair’s marginals are honored before averaging.
Code
kappam.light(diagnoses) # mean of pairwise Cohen's kappa
#> Light's Kappa for m Raters
#>
#> Subjects = 30
#> Raters = 6
#> Kappa = 0.459
#>
#> z = 2.31
#> p-value = 0.021138.4 Krippendorff’s Alpha: A General Reliability Coefficient
The kappa family carries assumptions that real coding violates: it presumes complete data (every rater codes every item), a fixed measurement level, and—for Fleiss—exchangeable raters. Krippendorff’s \(\alpha\) dispenses with these. It is the most general agreement coefficient: it accommodates any number of raters, missing values, unequal numbers of raters per item, and any measurement level—nominal, ordinal, interval, or ratio—through a single pluggable difference function. Rather than correcting observed agreement, \(\alpha\) is built on observed and expected disagreement:
\[ \alpha = 1 - \frac{D_o}{D_e}, \tag{38.5}\]
which is 1 rearranged: writing \(S = 1 - (1-p_o)/(1-p_e)\) and identifying \(1 - p_o\) with observed disagreement \(D_o\) and \(1 - p_e\) with expected disagreement \(D_e\) makes the equivalence explicit. The observed disagreement \(D_o\) aggregates, over all pairs of ratings of the same item, a metric \(\delta^2(c,c')\) measuring how far apart categories \(c\) and \(c'\) are; the expected disagreement \(D_e\) aggregates the same metric over all pairs drawn from the data irrespective of item, i.e., under independence. The genius of the construction is that the measurement level enters only through \(\delta^2\): it is an indicator \(\mathbb{1}[c \neq c']\) for nominal data, a squared rank distance for ordinal data, and \((c-c')^2\) for interval data. One estimator therefore spans all scales, and because \(D_o\) and \(D_e\) are computed over coincidences (pairs of available ratings) rather than complete records, missing data are handled natively.
For interpretation, Shelley and Krippendorff (1984) recommend treating \(\alpha \geq 0.80\) as acceptable reliability and \(\alpha \geq 0.667\) as the lowest threshold at which tentative conclusions may be drawn—a deliberately conservative bar reflecting the coefficient’s use as a gatekeeper for publishable coding.
Reliability data must be reproducible: a coefficient certifies that the data would arise again under independent coding by other competent coders applying the same instructions. By this standard, variables with \(\alpha < 0.667\) should not be analyzed, and only \(\alpha \geq 0.80\) licenses confident substantive claims (Shelley and Krippendorff 1984).
The following example—four observers rating twelve units, with missing entries coded NA—computes \(\alpha\) under each measurement level so the role of \(\delta^2\) is visible: the same coincidences yield different coefficients purely because the distance between categories is defined differently.
Code
ratings_mat <- matrix(
c(1, 1, NA, 1, 2, 2, 3, 2, 3, 3, 3, 3,
3, 3, 3, 3, 2, 2, 2, 2, 1, 2, 3, 4,
4, 4, 4, 4, 1, 1, 2, 1, 2, 2, 2, 2,
NA, 5, 5, 5, NA, NA, 1, 1, NA, NA, 3, NA),
nrow = 4, byrow = TRUE # 4 observers (rows) x 12 units (columns)
)
kripp.alpha(ratings_mat, method = "nominal") # delta = 1[c != c']
#> Krippendorff's alpha
#>
#> Subjects = 12
#> Raters = 4
#> alpha = -0.0658
kripp.alpha(ratings_mat, method = "ordinal") # rank-distance metric
#> Krippendorff's alpha
#>
#> Subjects = 12
#> Raters = 4
#> alpha = 0.166
kripp.alpha(ratings_mat, method = "interval") # squared difference
#> Krippendorff's alpha
#>
#> Subjects = 12
#> Raters = 4
#> alpha = 0.179
kripp.alpha(ratings_mat, method = "ratio") # ratio-scale metric
#> Krippendorff's alpha
#>
#> Subjects = 12
#> Raters = 4
#> alpha = 0.083Moving from nominal to interval typically raises \(\alpha\) when most disagreements are between adjacent categories, because the metric awards partial credit for near misses—exactly the ordinal logic of weighted kappa, generalized. Reporting the measurement level alongside the coefficient is therefore essential: an unqualified “\(\alpha = 0.74\)” is incomplete.
38.5 Kendall’s W: Agreement Among Rankings
Some qualitative tasks ask raters not to categorize items but to rank them: ordering brand concepts by appeal, prioritizing themes by salience, or sorting stimuli. Agreement is then concordance among orderings, and the natural coefficient is Kendall’s coefficient of concordance \(W\). For \(m\) raters each ranking \(n\) objects, let \(R_i\) be the sum of ranks object \(i\) receives. If raters agree, some objects accumulate consistently high rank sums and others consistently low, so the \(R_i\) are dispersed; if raters are random, every object draws a middling sum and the \(R_i\) cluster. \(W\) normalizes the realized dispersion by its maximum:
\[ W = \frac{12 \sum_{i=1}^{n} \left(R_i - \bar{R}\right)^2}{m^2\,(n^3 - n)}, \qquad \bar{R} = \frac{m(n+1)}{2}, \tag{38.6}\]
where the denominator \(m^2(n^3-n)/12\) is the variance of the rank sums under perfect agreement, so \(W \in [0,1]\) with \(1\) denoting unanimous ordering and \(0\) denoting no association.2 Under the null of no concordance, \(m(n-1)W\) is approximately \(\chi^2_{n-1}\), giving a significance test; the correction for tied ranks reduces the denominator accordingly.
Code
rankings <- cbind(
rater1 = c(1, 6, 3, 2, 5, 4),
rater2 = c(1, 5, 6, 2, 4, 3),
rater3 = c(2, 3, 6, 5, 4, 1)
)
DescTools::KendallW(rankings, test = TRUE)
#>
#> Kendall's coefficient of concordance W
#>
#> data: rankings
#> Kendall chi-squared = 8.5238, df = 5, subjects = 6, raters = 3, p-value
#> = 0.1296
#> alternative hypothesis: W is greater 0
#> sample estimates:
#> W
#> 0.56825438.6 Intraclass Correlation: Reliability as Variance Decomposition
When ratings are continuous—perceived quality on a 0–100 scale, sentiment scores, expert intensity judgments—agreement becomes a question about variance. The intraclass correlation coefficient (ICC) of Shrout and Fleiss (1979) asks: of all the variation in the ratings, how much is true between-subject signal versus rater noise? Intuitively, a measurement is reliable when subjects differ from one another far more than raters differ on the same subject. The ICC formalizes this as the share of total variance attributable to subjects.
The ICC is defined inside a variance-components model. Let \(y_{ij}\) be the rating of subject \(i\) by rater \(j\). The two-way model decomposes it as \[ y_{ij} = \mu + \alpha_i + \beta_j + \varepsilon_{ij}, \qquad \alpha_i \sim \mathcal{N}(0, \sigma_\alpha^2),\; \beta_j \sim \mathcal{N}(0, \sigma_\beta^2),\; \varepsilon_{ij} \sim \mathcal{N}(0, \sigma_\varepsilon^2), \tag{38.7}\]
where \(\mu\) is the grand mean, \(\alpha_i\) the subject effect (the signal we want to measure reliably), \(\beta_j\) the rater effect (systematic leniency or severity of rater \(j\)), and \(\varepsilon_{ij}\) residual error. The variance components are estimated from the rating table’s mean squares by analysis of variance. The ICC is then a ratio of variance components, but which ratio depends on three design decisions that must be stated before estimation, because they change the estimand—not merely its estimate:
- Model. A one-way model treats only subjects as random and folds rater and error into a single residual; it applies when each subject is rated by a different, randomly chosen set of raters. A two-way model adds a rater effect \(\beta_j\) and applies when the same raters score every subject.
- Definition: agreement vs. consistency. Absolute agreement counts rater bias \(\sigma_\beta^2\) as error—two raters who are perfectly correlated but offset by ten points do not agree. Consistency excludes \(\sigma_\beta^2\), asking only whether raters rank-order subjects alike. Use agreement when the rating’s absolute level is interpreted; consistency when only relative standing matters.
- Unit: single vs. average. Report reliability of a single rater’s score when one coder will rate future items, or of the average of \(k\) raters when the mean of the panel is the operative measurement. Averaging suppresses error, so the average-measure ICC is mechanically higher (the Spearman–Brown relationship).
The two-way, absolute-agreement, single-measure ICC makes these choices concrete: \[ \text{ICC}(\text{agreement}, 1) = \frac{\sigma_\alpha^2} {\sigma_\alpha^2 + \sigma_\beta^2 + \sigma_\varepsilon^2}. \tag{38.8}\] The consistency variant simply drops \(\sigma_\beta^2\) from the denominator, mechanically raising the coefficient; the gap between the two quantifies how much rater bias inflates apparent agreement. Figure 38.1 maps the decisions to the estimand.
flowchart TD
A["Continuous ratings"] --> B{"Same raters score<br/>every subject?"}
B -->|"No: raters vary<br/>by subject"| C["One-way model<br/>ICC(1,·)"]
B -->|"Yes: fixed panel"| D{"Is the absolute<br/>level interpreted?"}
D -->|"Yes"| E["Two-way<br/>Absolute agreement"]
D -->|"No: only ranking<br/>matters"| F["Two-way<br/>Consistency"]
C --> G{"Score future items<br/>with one rater<br/>or a panel mean?"}
E --> G
F --> G
G -->|"One rater"| H["Single-measure ICC"]
G -->|"Panel mean of k"| I["Average-measure ICC<br/>(Spearman–Brown higher)"]
The anxiety data—three raters scoring the same twenty subjects—illustrate the two-way, absolute-agreement, single-measure case.
Code
data("anxiety", package = "irr")
icc(anxiety,
model = "twoway", # same raters score all subjects
type = "agreement", # absolute level matters; counts rater bias as error
unit = "single") # reliability of one rater's score
#> Single Score Intraclass Correlation
#>
#> Model: twoway
#> Type : agreement
#>
#> Subjects = 20
#> Raters = 3
#> ICC(A,1) = 0.198
#>
#> F-Test, H0: r0 = 0 ; H1: r0 > 0
#> F(19,38) = 1.83 , p = 0.0562
#>
#> 95%-Confidence Interval for ICC Population Values:
#> -0.039 < ICC < 0.494DescTools::ICC() reports all six ICC forms of Shrout and Fleiss (1979) at once, which is the disciplined way to expose how sensitive the conclusion is to the design assumptions: a reliability that looks “substantial” as a consistency, average-measure coefficient can collapse as a single-rater, absolute-agreement coefficient.
Code
panel <- cbind(
rater1 = c(9, 6, 8, 7, 10, 6),
rater2 = c(2, 1, 4, 1, 5, 2),
rater3 = c(5, 3, 6, 2, 6, 4),
rater4 = c(8, 2, 8, 6, 9, 7)
)
DescTools::ICC(panel)
#>
#> Intraclass correlation coefficients
#> type est F-val df1 df2 p-val lwr.ci upr.ci
#> Single_raters_absolute ICC1 0.166 1.79 5 18 0.164769 NA NA
#> Single_random_raters ICC2 0.290 11.03 5 15 0.000135 NA NA
#> Single_fixed_raters ICC3 0.715 11.03 5 15 0.000135 NA NA
#> Average_raters_absolute ICC1k 0.443 1.79 5 18 0.164769 NA NA
#> Average_random_raters ICC2k 0.620 11.03 5 15 0.000135 NA NA
#> Average_fixed_raters ICC3k 0.909 11.03 5 15 0.000135 NA NA
#>
#> Number of subjects = 6 Number of raters = 4Here the raters track one another’s ordering of subjects well but sit at very different absolute levels (rater 1 is systematically generous, rater 2 severe). Consistency ICCs are accordingly high while agreement ICCs are low—a textbook case of why Equation 38.8, not the consistency form, is the honest choice when the score’s level will be interpreted.
38.7 Choosing and Reporting a Coefficient
The coefficients are not interchangeable, and selecting one post hoc to maximize the reported number is a form of specification search. The choice is dictated by the data’s structure—measurement level, number of raters, presence of missing values, and whether the rating’s absolute level carries meaning—as mapped in Table 38.1 and Figure 38.1. Krippendorff’s \(\alpha\) is the safest default for categorical coding because it degrades gracefully to the others’ use cases while tolerating missing data and any measurement level; the kappa coefficients remain standard and expected in many literatures; the ICC is mandatory for continuous ratings.
Three reporting disciplines separate credible IRR from box-checking. First, report the coefficient with a confidence interval, not a bare point estimate: a \(\kappa\) of 0.70 from 20 items is far weaker evidence than the same value from 500. Second, state every assumption that defines the estimand—the measurement level and difference metric for \(\alpha\), the model/definition/unit triple for the ICC—because, as the examples show, the same data yield materially different coefficients under different assumptions. Third, remember that reliability bounds but does not establish validity: a perfectly reliable code can still measure the wrong construct, and the validity argument belongs to Chapter 35. High agreement earns the right to interpret the codes; it does not, by itself, make the interpretation correct.
38.8 Key Takeaways
- Inter-rater reliability is the methodological hinge of qualitative coding and of every human-labeled “ground truth” used to validate text- and image-mining models (Chapter 35).
- Raw percent agreement is uninterpretable because it credits agreement expected from category base rates; all defensible coefficients chance-correct via
- The kappa family (Equation 38.2, Equation 38.3, Equation 38.4) handles nominal and ordinal categories; weighting awards partial credit for near misses on ordered scales (Cohen 1960; Landis and Koch 1977).
- Krippendorff’s \(\alpha\) (Equation 38.5) is the general coefficient—any number of raters, missing data, any measurement level—and its thresholds gatekeep publishable coding (Shelley and Krippendorff 1984); Kendall’s \(W\) (Equation 38.6) handles rankings.
- For continuous ratings, the ICC (Equation 38.7, Equation 38.8) embeds reliability in a variance-components model whose model/definition/unit choices change the estimand and must be justified by the design, not the result (Shrout and Fleiss 1979).
- Report coefficients with confidence intervals and explicit assumptions, and never mistake reliability for validity.
The numerical equivalence of quadratically weighted \(\kappa\) and the two-way ICC holds when the marginal rating distributions are equal across raters; it fails as the marginals diverge, which is one reason ICC is preferred when raters use the scale with systematically different generosity (a prevalence or bias problem the unweighted \(\kappa\) also suffers, the so-called kappa paradox).↩︎
\(W\) relates linearly to the mean Spearman rank correlation \(\bar{r}_s\) across all rater pairs via \(W = [(m-1)\bar{r}_s + 1]/m\). Because \(W\) is bounded in \([0,1]\) it cannot represent systematic disagreement (negative association), which \(\bar{r}_s\) can—a limitation to keep in mind when raters may be genuinely opposed rather than merely noisy.↩︎