35 Measurement Scales

Most of the constructs marketing cares about—brand equity, satisfaction, involvement, trust, perceived authenticity—cannot be read off a receipt. They are latent: properties of a consumer’s mind that leave only indirect traces in answers to survey questions, in choices, and in behavior. A measurement scale is the instrument that turns those traces into numbers we can analyze, and the measurement model is the formal theory that says how the numbers relate to the construct they are supposed to capture. Getting the measurement model wrong is not a cosmetic error. It silently biases every downstream coefficient, invalidates the reliability and validity tests a reviewer will demand, and—most consequentially—can reverse a managerial recommendation, because the model dictates whether a manager should strengthen a coherent latent disposition or intervene on specific, possibly uncorrelated, drivers.

This chapter develops measurement as a modeling problem rather than a checklist. We begin with the classical theory that underwrites almost all psychometrics—true-score theory—and use it to define reliability precisely. We then draw the central distinction of the modern literature, between reflective and formative measurement, and make it formal with structural equations, including the identification conditions that decide whether a model can be estimated at all. With that machinery in place we walk through scale development as a disciplined sequence, define the standard battery of reliability and validity statistics and give their estimators and assumptions, and confront common-method bias—the contamination that arises when the same respondent supplies both predictor and outcome through the same instrument. Throughout, the goal is the one set in Chapter 11 when it asked whether brand equity is reflective or formative: to let the reader decide a construct’s measurement model on principled grounds and defend that decision to a skeptical referee.

A note on scope. We treat psychometric measurement—scales built from multiple survey items—because that is where the reflective/formative question is sharpest and where the field’s two leading handbooks of marketing scales live (Bearden, Netemeyer, and Haws 2011; Stewart et al. 1993). The logic transfers directly to behavioral and text-derived measures, which are developed in Chapter 43.

35.1 True-Score Theory and the Definition of Reliability

The foundation of classical test theory is disarmingly simple. An observed score \(X\) on an item is the sum of a true score \(T\)—the construct value we wish we could see—and a measurement error \(E\):

\[ X = T + E . \tag{35.1}\]

The theory’s content is in three assumptions about \(E\): it has mean zero, \(\mathbb{E}[E] = 0\), so \(T = \mathbb{E}[X]\) is defined as the expected observed score; it is uncorrelated with the true score, \(\mathrm{Cov}(T, E) = 0\); and errors on distinct items are mutually uncorrelated. Under these assumptions the observed variance decomposes additively,

\[ \sigma_X^2 = \sigma_T^2 + \sigma_E^2 , \tag{35.2}\]

and reliability is defined as the share of observed variance that is true variance,

\[ \rho_{XX'} \;=\; \frac{\sigma_T^2}{\sigma_X^2} \;=\; \frac{\sigma_T^2}{\sigma_T^2 + \sigma_E^2} \in [0,1]. \tag{35.3}\]

Reliability is therefore not “consistency” in a loose sense; it is a variance ratio, equal to the squared correlation between observed and true scores and to the correlation between two parallel measurements of the same construct (hence the notation \(\rho_{XX'}\)). It says nothing about whether \(T\) measures the right construct—that is validity, and the two must never be conflated. A bathroom scale that always reads three kilograms heavy is perfectly reliable and invalid.

Reliability matters beyond bookkeeping because unreliable measurement attenuates relationships. If a predictor \(X\) measures a true \(T\) with reliability \(\rho_{XX}\) and an outcome \(Y\) measures \(U\) with reliability \(\rho_{YY}\), the observable correlation is shrunk toward zero relative to the construct-level correlation:

\[ \mathrm{Corr}(X, Y) \;=\; \mathrm{Corr}(T, U)\,\sqrt{\rho_{XX}\,\rho_{YY}} . \tag{35.4}\]

Equation 35.4 is the correction for attenuation. It is why a true effect can be buried by a noisy scale, and why latent-variable models—which estimate \(\mathrm{Corr}(T,U)\) directly—are preferred to regressions on summed scores when reliability is modest.

Multiple-item measurement exists almost entirely to beat down \(\sigma_E^2\). By the Spearman–Brown logic, averaging \(k\) interchangeable items whose errors are independent divides the error variance of the composite by \(k\) while leaving the true variance intact, so a composite of many noisy items can be far more reliable than any single item.¹ This is the engine behind multi-item scales—and, as we will see, it is exactly the property a formative index does not have.

35.2 Reflective versus Formative Measurement

The single most consequential decision in scale construction is the direction of causality between the latent construct and its indicators. The choice is not a matter of taste; it is a substantive claim about how the world works, and it determines which model is even identified (Coltman et al. 2008; Diamantopoulos, Fritz, and Hildebrandt 2013).

35.2.1 The Reflective Model

In a reflective (effect-indicator) model, the latent construct causes its indicators: the items are manifestations, symptoms, or reflections of an underlying common cause. Let \(\eta\) be a scalar latent construct and \(x_1,\dots,x_K\) its observed indicators. The model is the common-factor model,

\[ x_k \;=\; \lambda_k\, \eta \;+\; \varepsilon_k, \qquad k = 1,\dots,K, \tag{35.5}\]

with \(\mathbb{E}[\varepsilon_k]=0\), \(\mathrm{Cov}(\eta,\varepsilon_k)=0\), and \(\mathrm{Cov}(\varepsilon_j,\varepsilon_k)=0\) for \(j\neq k\) (uncorrelated errors). The arrows run from \(\eta\) to each \(x_k\). Three properties follow and are diagnostic. First, because the indicators share a single common cause, they must be positively correlated; the off-diagonal covariances are \(\mathrm{Cov}(x_j,x_k)=\lambda_j\lambda_k\,\sigma_\eta^2\), all of the same sign as the loadings. Second, the indicators are interchangeable: dropping one changes the construct’s domain coverage but not its meaning, since each is a noisy replica of the same \(\eta\). Third, internal consistency (Cronbach’s alpha, composite reliability) is the appropriate quality criterion, precisely because Equation 35.5 is the model under which Equation 35.3 was derived.

A reflective construct is the right model when the construct is a real, unitary attribute that radiates into measurable symptoms—the way a fever radiates into a high thermometer reading, a flushed face, and chills. Brand love (Batra, Ahuvia, and Bagozzi 2012; Bagozzi, Batra, and Ahuvia 2016), satisfaction, and attitude are conventionally modeled this way: a more satisfied customer answers every satisfaction item more favorably, so the items co-move.

35.2.2 The Formative Model

In a formative (causal-indicator) model, the indicators cause the construct: the latent variable is a composite formed by its indicators, which are defining ingredients rather than reflections. The structural equation reverses the arrows,

\[ \eta \;=\; \sum_{k=1}^{K} \gamma_k\, x_k \;+\; \zeta, \tag{35.6}\]

where \(\gamma_k\) are formative weights and \(\zeta\) is a disturbance capturing the part of \(\eta\) not explained by the chosen indicators, with \(\mathrm{Cov}(x_k,\zeta)=0\). Everything that was diagnostic for the reflective model is now reversed. The indicators need not be correlated—socioeconomic status formed from income, education, and occupation does not require that high earners be highly educated. The indicators are not interchangeable: removing one (say, income) deletes part of the construct’s definition, not merely part of its sampling. And internal-consistency reliability is inapplicable: there is no common cause to make the items co-vary, so a low alpha is no defect. Quality is judged instead by indicator weights, their significance, and multicollinearity among indicators (high collinearity makes the weights unstable and uninterpretable), typically diagnosed with the variance inflation factor.

A construct should be formative when it is constituted by a combination of distinct facets that need not move together. Classic examples are an index of life stress (a job loss and a marriage are both stressors but are uncorrelated causes), a marketing capability composite, or a deprivation index. Chapter 11 raised the live debate over whether brand equity and brand authenticity are reflective or formative; the two leading authenticity scales take opposite stances—Morhart et al. (2015) reflective, Nunes, Ordanini, and Giambastiani (2021) formative—and the disagreement is exactly a disagreement about the direction of Equation 35.5 versus Equation 35.6.

35.2.3 A Formal Contrast and the MIMIC Bridge

Figure 35.1 contrasts the two path structures, and the table that follows states the operational consequences side by side.

flowchart LR
  subgraph R["Reflective"]
    direction TB
    ETA1((η)) --> X1[x1]
    ETA1 --> X2[x2]
    ETA1 --> X3[x3]
    X1 --- E1((ε1))
    X2 --- E2((ε2))
    X3 --- E3((ε3))
  end
  subgraph F["Formative"]
    direction TB
    Z1[x1] --> ETA2((η))
    Z2[x2] --> ETA2
    Z3[x3] --> ETA2
    ZETA((ζ)) --- ETA2
  end

Figure 35.1: Reflective versus formative measurement. In the reflective model the latent construct η causes the indicators (arrows out, each indicator carries its own error ε). In the formative model the indicators cause the construct (arrows in, a single disturbance ζ attaches to the construct). A MIMIC model combines both: formative causes and reflective indicators of the same latent variable.

Table 35.1: Operational consequences of the reflective–formative distinction.

Property	Reflective (Equation 35.5)	Formative (Equation 35.6)
Causal direction	Construct → indicators	Indicators → construct
Indicator correlation	High, same sign (required)	Not required; any sign
Interchangeability	Indicators interchangeable	Indicators define the construct
Dropping an indicator	Narrows sampling of the domain	Changes the construct’s meaning
Error term	One per indicator (\(\varepsilon_k\))	One per construct (\(\zeta\))
Reliability criterion	Internal consistency (alpha, CR, AVE)	Indicator weights; collinearity (VIF)
Identification	Standard (factor model)	Needs external structure (e.g., MIMIC)

The deepest difference in Table 35.1 is the last row. A pure formative model is not identified in isolation: Equation 35.6 places no testable restrictions on the indicator covariances, the scale and location of \(\eta\) are arbitrary, and the disturbance variance \(\sigma_\zeta^2\) cannot be separated from the weights without further information. Identification requires that the formative construct emit at least two reflective indicators (or two paths to downstream constructs), which fixes its metric and disturbance. The resulting hybrid is the multiple-indicators, multiple-causes (MIMIC) model: formative causes \(x_1,\dots,x_K\) feed a latent \(\eta\), which in turn emits reflective indicators \(y_1,\dots,y_M\),

\[ \eta \;=\; \sum_{k=1}^{K}\gamma_k x_k + \zeta, \qquad y_m \;=\; \lambda_m \eta + \varepsilon_m, \quad m = 1,\dots,M . \tag{35.7}\]

The MIMIC structure is the standard way marketing accommodates a genuinely formative construct inside an estimable structural-equation model (Diamantopoulos, Fritz, and Hildebrandt 2013). It also clarifies a common error: estimating Equation 35.6 by ordinary least squares treats \(\eta\) as observed and conflates the formative weights with a regression of an index on its parts, discarding the disturbance and the very latency that motivated the construct.

35.2.4 Why Misspecification Is Costly

Treating a formative construct as reflective (or the reverse) is not a venial sin. Forcing reflective machinery onto formative indicators discards items that fail to correlate—dropping low-alpha items narrows the construct’s content, biasing it toward whatever facet happens to be internally consistent. Conversely, modeling a reflective construct formatively throws away the error structure that Equation 35.4 depends on and inflates apparent validity. Simulation and analytic work show the resulting structural-coefficient bias can be large and of either sign, which is why the direction-of-causality decision belongs at the start of scale development, justified on conceptual grounds, not discovered after the data are in.

35.3 Scale Development

Scale development is the disciplined translation of a construct definition into a validated instrument. The field’s canonical paradigm is a staged procedure that moves from conceptual specification to a purified, validated multi-item scale, and modern treatments differ mainly in how much weight they place on the formative possibility (Bearden, Netemeyer, and Haws 2011). The stages below are reflective by default; we flag where a formative construct diverges.

Construct definition and domain specification. State precisely what the construct is and is not, its conceptual boundaries, and—critically—the direction of measurement (Table 35.1). This step decides everything downstream and cannot be repaired later by statistics.
Item generation. Draft a large pool of candidate items spanning the domain, from theory, qualitative interviews, and existing scales. For reflective constructs the pool should oversample each facet (items are replaceable); for formative constructs the pool must exhaust the facets (each is constitutive).
Content validity / expert judging. Have domain experts rate each item’s relevance and representativeness, retaining items with high agreement. This is a qualitative check that no statistic substitutes for.
Purification on a calibration sample. Collect data and trim items. For reflective scales, drop items with low item–total correlations or weak loadings and inspect exploratory factor structure for dimensionality. For formative indices, do not trim on inter-item correlation; trim only on collinearity and weight significance.
Confirmatory validation on a fresh sample. Fit the hypothesized measurement model (confirmatory factor analysis for reflective constructs; a MIMIC model for formative) and assess fit, reliability, and the validity battery of Section 35.4 below. Cross-validation on data not used for purification guards against capitalizing on chance.
Norming and replication. Establish means, variances, and stability across populations and over time, and demonstrate the nomological network—correlations with antecedents and consequences in the predicted pattern.

Figure 35.2 renders the pipeline. The arrows back from validation to item generation are essential: scale development is iterative, and a construct that fails discriminant validity often needs its definition sharpened, not just its items re-fitted.

flowchart TB
  A[Define construct\n& measurement direction] --> B[Generate item pool]
  B --> C[Expert content judging]
  C --> D[Purify on\ncalibration sample]
  D --> E[Confirm on\nfresh sample]
  E --> F[Norm & replicate\nnomological network]
  E -. fails discriminant/convergent .-> A
  D -. weak items .-> B

Figure 35.2: The scale-development pipeline. Conceptual specification fixes the measurement direction; item generation and content judging are qualitative; purification and confirmatory validation are quantitative and run on separate samples. Validation failures feed back to redefinition, not merely re-estimation.

35.4 Reliability and Validity

A scale must be both reliable (low error variance) and valid (measures the intended construct). We give the standard estimators, their assumptions, and what breaks them.

35.4.1 Internal-Consistency Reliability

Cronbach’s alpha estimates reliability from a single administration of \(K\) items as

\[ \alpha \;=\; \frac{K}{K-1}\left(1 - \frac{\sum_{k=1}^{K}\sigma_{x_k}^2}{\sigma_{X_{\text{tot}}}^2}\right), \tag{35.8}\]

where \(\sigma_{x_k}^2\) is item variance and \(\sigma_{X_{\text{tot}}}^2\) the variance of the summed score. Alpha is a lower bound on reliability and equals it only under tau-equivalence—all items load equally on the common factor. When loadings differ (the congeneric case), alpha understates reliability, and composite reliability (CR), computed from the CFA loadings \(\hat\lambda_k\) and error variances \(\hat\theta_k\),

\[ \mathrm{CR} \;=\; \frac{\left(\sum_{k}\hat\lambda_k\right)^2}{\left(\sum_{k}\hat\lambda_k\right)^2 + \sum_{k}\hat\theta_k}, \tag{35.9}\]

is preferred because it does not assume equal loadings. Both assume a reflective unidimensional model; neither is meaningful for a formative index, where the items are not expected to covary. Conventional thresholds put acceptable internal consistency at roughly \(0.70\) and above, though the number is a heuristic, not a law.

35.4.2 Convergent and Discriminant Validity

Construct validity asks whether the scale measures its target construct. Fornell and Larcker (1981) give the workhorse criteria for reflective constructs estimated by CFA. Convergent validity—indicators of one construct converge—is evidenced when the average variance extracted (AVE),

\[ \mathrm{AVE} \;=\; \frac{\sum_{k}\hat\lambda_k^2}{\sum_{k}\hat\lambda_k^2 + \sum_{k}\hat\theta_k}, \tag{35.10}\]

exceeds \(0.50\), i.e., the construct explains more than half the variance in its indicators. Discriminant validity—distinct constructs are empirically distinct—was traditionally assessed by the Fornell–Larcker criterion: each construct’s \(\sqrt{\mathrm{AVE}}\) must exceed its correlation with every other construct (Fornell and Larcker 1981). This criterion is intuitive but has poor statistical power in Monte Carlo simulations: the cross-loadings criterion—checking that each indicator loads more strongly on its own construct than on others—has near-zero sensitivity in variance-based SEM (Henseler, Ringle, and Sarstedt 2015). The current recommended criterion for variance-based SEM is the heterotrait–monotrait ratio of correlations (HTMT) and its improved successor HTMT2 (Henseler, Ringle, and Sarstedt 2015; Roemer, Schuberth, and Henseler 2021).

The HTMT is defined as the ratio of (a) the average of all cross-construct inter-indicator correlations (heterotrait–heteromethod correlations) to (b) the geometric mean of the average within-construct inter-indicator correlations (monotrait–heteromethod correlations),

\[ \mathrm{HTMT} = \frac{\bar{r}_{\text{HT}}}{\sqrt{\bar{r}_{A} \cdot \bar{r}_{B}}}, \tag{35.11}\]

where \(\bar{r}_{\text{HT}}\) is the mean of all correlations between indicators of different constructs and \(\bar{r}_A\), \(\bar{r}_B\) are the mean within-construct correlations for each construct. A value of HTMT below \(0.85\) (or \(0.90\) in exploratory work) is evidence of discriminant validity; values approaching 1 indicate that the two constructs are empirically indistinguishable. HTMT2 (Roemer, Schuberth, and Henseler 2021) replaces the arithmetic mean in the denominator with a geometric mean of the individual within-construct correlations, producing a consistent estimator of the inter-construct correlation under congeneric (heterogeneous loading) models; an 816,000-condition Monte Carlo study found HTMT2 less biased than HTMT when loadings are heterogeneous. For covariance-based SEM (CFA), discriminant validity is better assessed by a \(\chi^2\) difference test or confidence-interval test comparing a free-correlation model to one constraining the inter-construct correlation to 1.

A construct that fails discriminant validity against a neighbor is not separately measurable from it, and the usual remedy is conceptual: merge the constructs or sharpen their definitions. These criteria are reflective; for formative constructs, validity is assessed through the nomological net and the significance of formative weights instead.

35.4.3 Criterion and Nomological Validity

Criterion validity is the scale’s correlation with an external criterion it should predict—concurrent when measured contemporaneously, predictive when the criterion is future behavior. Nomological validity, the most demanding, asks whether the construct relates to other constructs as theory dictates: a valid brand-equity scale should predict price premium and choice share in the direction Chapter 11 specifies. Validity, unlike reliability, is never established by a single coefficient; it is the accumulated weight of a construct behaving as its theory says it should.

35.4.4 Inter-Rater Reliability for Coded Data

When measurement comes from human coders rather than respondents—coding ad content, classifying reviews, labeling images—reliability is agreement between raters corrected for chance. Cohen’s kappa (Cohen 1960) for two raters is

\[ \kappa \;=\; \frac{p_o - p_e}{1 - p_e}, \tag{35.12}\]

where \(p_o\) is observed agreement and \(p_e\) is the agreement expected if raters labeled independently at their marginal rates. \(\kappa = 1\) is perfect agreement, \(\kappa = 0\) is chance-level; conventional benchmarks read \(0.41\)–\(0.60\) as moderate and above \(0.60\) as substantial (Landis and Koch 1977). For continuous ratings, the intraclass correlation coefficient (ICC) plays the analogous role, partitioning variance into rater, target, and error components (Shrout and Fleiss 1979). The worked example below estimates kappa for a two-coder labeling task.

Code

set.seed(42)

# Two coders label 200 reviews as positive (1) or negative (0).
# Simulate a shared latent truth plus idiosyncratic coder noise.
n      <- 200
truth  <- rbinom(n, 1, 0.5)
flip_a <- rbinom(n, 1, 0.10)          # coder A errs 10% of the time
flip_b <- rbinom(n, 1, 0.12)          # coder B errs 12% of the time
coder_a <- ifelse(flip_a == 1, 1 - truth, truth)
coder_b <- ifelse(flip_b == 1, 1 - truth, truth)

tab <- table(coder_a, coder_b)
p_o <- sum(diag(tab)) / sum(tab)                       # observed agreement
p_e <- sum(rowSums(tab) * colSums(tab)) / sum(tab)^2   # chance agreement
kappa <- (p_o - p_e) / (1 - p_e)

cat("Observed agreement p_o:", round(p_o, 3), "\n")
#> Observed agreement p_o: 0.805
cat("Chance agreement   p_e:", round(p_e, 3), "\n")
#> Chance agreement   p_e: 0.504
cat("Cohen's kappa        :", round(kappa, 3), "\n")
#> Cohen's kappa        : 0.607

35.4.5 A Worked Reflective Measurement Model

To make Equation 35.5 through Equation 35.10 concrete, we simulate a single reflective construct with four congeneric indicators, recover the loadings by factor analysis, and compute composite reliability and AVE from the estimates.

Code

set.seed(2024)

n       <- 500
lambda  <- c(0.85, 0.80, 0.75, 0.70)   # true (heterogeneous) loadings
theta   <- 1 - lambda^2                 # error variances (unit-variance items)
eta     <- rnorm(n)                     # latent construct, Var = 1

# Generate indicators under the reflective model x_k = lambda_k * eta + e_k
X <- sapply(seq_along(lambda), function(k) lambda[k] * eta + rnorm(n, sd = sqrt(theta[k])))
colnames(X) <- paste0("x", seq_along(lambda))

# Recover a one-factor solution; loadings are the std. correlations with the factor
fa <- factanal(X, factors = 1, scores = "none")
lhat <- as.numeric(fa$loadings[, 1])     # estimated loadings
ehat <- 1 - lhat^2                       # implied error variances

CR  <- sum(lhat)^2 / (sum(lhat)^2 + sum(ehat))
AVE <- sum(lhat^2) / (sum(lhat^2) + sum(ehat))

# Cronbach's alpha from the item covariance matrix
S       <- cov(X)
K       <- ncol(X)
alpha   <- (K / (K - 1)) * (1 - sum(diag(S)) / sum(S))

cat("Estimated loadings:", round(lhat, 3), "\n")
#> Estimated loadings: 0.85 0.831 0.71 0.72
cat("Cronbach's alpha  :", round(alpha, 3), "\n")
#> Cronbach's alpha  : 0.859
cat("Composite reliab. :", round(CR, 3), "\n")
#> Composite reliab. : 0.861
cat("AVE               :", round(AVE, 3), "(> 0.50 => convergent)\n")
#> AVE               : 0.609 (> 0.50 => convergent)

Because the loadings are heterogeneous, alpha will sit below composite reliability in this example—the tau-equivalence gap of Equation 35.8 made visible. AVE above \(0.50\) confirms convergent validity for the simulated construct.

35.5 Common-Method Bias

A pervasive threat to survey-based measurement is common-method variance (CMV): variance attributable to the measurement method rather than the constructs the measures represent. When the same respondent rates both an independent and a dependent variable, on the same scale, at the same sitting, shared method factors—consistency motifs, social desirability, acquiescence, common scale anchors, transient mood—inject a spurious component into every item. The bias, common-method bias (CMB), is the distortion this induces in observed relationships, and it can inflate or deflate correlations (MacKenzie, Lutz, and Belch 1986).

35.5.1 A Formal Model of the Bias

Augment the reflective model Equation 35.5 with a method factor \(M\) common to all items measured by the shared method:

\[ x_k \;=\; \lambda_k\, \eta \;+\; \omega_k\, M \;+\; \varepsilon_k, \tag{35.13}\]

with \(M \perp \eta\), \(M \perp \varepsilon_k\), and method loadings \(\omega_k\). Two items measuring different substantive constructs \(\eta\) and \(\eta'\) now share the method factor, so their observed covariance picks up a method term that has nothing to do with their true relationship:

\[ \mathrm{Cov}(x_j, y_l) \;=\; \underbrace{\lambda_j \lambda_l\,\mathrm{Cov}(\eta,\eta')}_{\text{substantive}} \;+\; \underbrace{\omega_j \omega_l\,\sigma_M^2}_{\text{method bias}} . \tag{35.14}\]

When predictor and outcome share the method, the second term is added to every cross-construct covariance, biasing structural estimates—usually upward, because method loadings of the same sign inflate correlations, though differential signs can deflate. The contaminant is most dangerous in single-source, single-instrument designs: exactly the cross-sectional self-report survey that dominates applied marketing.

35.5.2 Procedural and Statistical Remedies

Defenses fall into two families. Procedural remedies design the bias out before data collection: obtain predictor and outcome measures from different sources (e.g., attitudes from the consumer, behavior from the firm’s records) or at different times; separate the measurement of constructs psychologically or temporally; counterbalance item order; protect respondent anonymity to blunt social desirability; and write items to reduce ambiguity and common scale anchoring. Separating the source of the predictor from the source of the outcome severs the \(M\) that links them in Equation 35.14 and is the single most effective remedy.

Statistical remedies detect or partial out the method factor after the fact, and none is a substitute for good design. Harman’s single-factor test—loading all items on one unrotated factor and checking that it explains less than half the variance—is widely reported but weak, detecting only gross contamination. Stronger is to estimate Equation 35.13 directly: include an unmeasured latent method factor in the CFA, let all indicators load on it alongside their substantive constructs, and compare structural estimates with and without it. A marker-variable approach uses a theoretically unrelated marker construct as a proxy for \(M\) and partials its correlation out of the substantive correlations. The diagnostic below implements the marker-variable logic on simulated data containing a known method component.

Code

set.seed(7)

n   <- 800
eta1 <- rnorm(n); eta2 <- rnorm(n)            # two substantive constructs
M    <- rnorm(n)                               # shared method factor

# Substantive correlation between constructs is set to ZERO here:
# any observed X-Y correlation is pure method bias.
w <- 0.5                                        # common method loading
x <- 0.8 * eta1 + w * M + rnorm(n, sd = 0.3)    # predictor (one item)
y <- 0.8 * eta2 + w * M + rnorm(n, sd = 0.3)    # outcome   (one item)
marker <- 0.0 * eta1 + w * M + rnorm(n, sd = 0.3)  # theoretically unrelated marker

raw_xy <- cor(x, y)

# Marker-variable correction: subtract the method-induced correlation,
# proxied by the smallest observed correlation with the marker.
r_xm <- cor(x, marker); r_ym <- cor(y, marker)
r_m  <- min(r_xm, r_ym)                         # conservative marker correlation
adj_xy <- (raw_xy - r_m) / (1 - r_m)            # partial out shared method

cat("Raw X-Y correlation (contaminated):", round(raw_xy, 3), "\n")
#> Raw X-Y correlation (contaminated): 0.312
cat("Marker correlation used           :", round(r_m, 3), "\n")
#> Marker correlation used           : 0.417
cat("Method-adjusted X-Y correlation   :", round(adj_xy, 3), "\n")
#> Method-adjusted X-Y correlation   : -0.181
cat("True substantive correlation      : 0.000\n")
#> True substantive correlation      : 0.000

The raw correlation is spuriously positive—pure method bias, since the substantive correlation was set to zero—and the marker adjustment pulls the estimate back toward the truth. The lesson is the one Equation 35.14 formalizes: when a relationship is estimated from a single self-report instrument, part of it may be the instrument talking to itself, and the credible defense is procedural separation of sources, with statistical partialling as a fallback.

35.6 Key Takeaways

Measurement is a modeling problem. Under true-score theory (Equation 35.1), reliability is a variance ratio (Equation 35.3), and unreliability attenuates every estimated relationship (Equation 35.4).
The reflective–formative distinction is a claim about the direction of causality between construct and indicators (Equation 35.5 vs. Equation 35.6) and dictates the entire validity battery; it must be decided conceptually, before data collection (Table 35.1).
A pure formative model is not identified; it needs external structure, most commonly the MIMIC specification (Equation 35.7), to fix its metric and disturbance.
Internal-consistency reliability (alpha, CR, AVE) apply to reflective constructs only; formative indices are judged by weights and collinearity.
The Fornell–Larcker criterion and cross-loadings assessment have poor sensitivity for detecting discriminant validity failures in variance-based SEM; HTMT and HTMT2 are the current recommended criteria, with thresholds below 0.85 indicating acceptable discriminant validity (Henseler, Ringle, and Sarstedt 2015; Roemer, Schuberth, and Henseler 2021).
Common-method bias arises when one instrument supplies both predictor and outcome (Equation 35.14); the strongest remedy is procedural—separate the sources—not a post-hoc statistical patch.

35.7 Further Reading

The two standard compendia of validated marketing instruments are the Handbook of Marketing Scales (Bearden, Netemeyer, and Haws 2011) and the Marketing Scales Handbook (Stewart et al. 1993); both organize scales by construct with reliability and validity evidence. For the reflective–formative debate, Coltman et al. (2008) and Diamantopoulos, Fritz, and Hildebrandt (2013) give the theoretical and operational decision rules, and Edwards (2010) examines the construct-validity consequences of the choice. Fornell and Larcker (1981) remains the reference for convergent and discriminant validity with unobservable variables; Henseler, Ringle, and Sarstedt (2015) introduced HTMT as the preferred discriminant validity criterion for variance-based SEM and Roemer, Schuberth, and Henseler (2021) refined it to HTMT2. MacKenzie, Lutz, and Belch (1986) is the standard reference for method effects in marketing measurement. The branding constructs whose measurement models are actively contested—brand love (Batra, Ahuvia, and Bagozzi 2012; Bagozzi, Batra, and Ahuvia 2016) and authenticity (Morhart et al. 2015; Nunes, Ordanini, and Giambastiani 2021)—are treated substantively in Chapter 11.

Bagozzi, Richard P., Rajeev Batra, and Aaron Ahuvia. 2016. “Brand Love: Development and Validation of a Practical Scale.” Marketing Letters 28 (1): 1–14. https://doi.org/10.1007/s11002-016-9406-1.

Batra, Rajeev, Aaron Ahuvia, and Richard P. Bagozzi. 2012. “Brand Love.” Journal of Marketing 76 (2): 1–16. https://doi.org/10.1509/jm.09.0339.

Bearden, William, Richard Netemeyer, and Kelly Haws. 2011. “Handbook of Marketing Scales: Multi-Item Measures for Marketing and Consumer Behavior Research.” https://doi.org/10.4135/9781412996761.

Cohen, Jacob. 1960. “A Coefficient of Agreement for Nominal Scales.” Educational and Psychological Measurement 20 (1): 37–46. https://doi.org/10.1177/001316446002000104.

Coltman, Tim, Timothy M. Devinney, David F. Midgley, and Sunil Venaik. 2008. “Formative Versus Reflective Measurement Models: Two Applications of Formative Measurement.” Journal of Business Research 61 (12): 1250–62. https://doi.org/10.1016/j.jbusres.2008.01.013.

Diamantopoulos, Adamantios, Wolfgang Fritz, and Lutz Hildebrandt. 2013. Quantitative Marketing and Marketing Management: Marketing Models and Methods in Theory and Practice. Springer.

Edwards, Jeffrey R. 2010. “The Fallacy of Formative Measurement.” Organizational Research Methods 14 (2): 370–88. https://doi.org/10.1177/1094428110378369.

Fornell, Claes, and David F. Larcker. 1981. “Structural Equation Models with Unobservable Variables and Measurement Error: Algebra and Statistics.” Journal of Marketing Research 18 (3): 382. https://doi.org/10.2307/3150980.

Henseler, Jörg, Christian M. Ringle, and Marko Sarstedt. 2015. “A New Criterion for Assessing Discriminant Validity in Variance-Based Structural Equation Modeling.” Journal of the Academy of Marketing Science 43 (1): 115–35. https://doi.org/10.1007/s11747-014-0403-8.

Landis, J. Richard, and Gary G. Koch. 1977. “The Measurement of Observer Agreement for Categorical Data.” Biometrics 33 (1): 159. https://doi.org/10.2307/2529310.

MacKenzie, Scott B., Richard J. Lutz, and George E. Belch. 1986. “The Role of Attitude Toward the Ad as a Mediator of Advertising Effectiveness: A Test of Competing Explanations.” Journal of Marketing Research 23 (2): 130. https://doi.org/10.2307/3151660.

Morhart, Felicitas, Lucia Malär, Amélie Guèvremont, Florent Girardin, and Bianca Grohmann. 2015. “Brand Authenticity: An Integrative Framework and Measurement Scale.” Journal of Consumer Psychology 25 (2): 200–218. https://doi.org/10.1016/j.jcps.2014.11.006.

Nunes, Joseph C., Andrea Ordanini, and Gaia Giambastiani. 2021. “The Concept of Authenticity: What It Means to Consumers.” Journal of Marketing 85 (4): 1–20. https://doi.org/10.1177/0022242921997081.

Roemer, Eva, Florian Schuberth, and Jörg Henseler. 2021. “HTMT2 – an Improved Criterion for Assessing Discriminant Validity in Structural Equation Modeling.” Industrial Management & Data Systems 121 (12): 2637–50. https://doi.org/10.1108/IMDS-02-2021-0082.

Shrout, Patrick E., and Joseph L. Fleiss. 1979. “Intraclass Correlations: Uses in Assessing Rater Reliability.” Psychological Bulletin 86 (2): 420–28. https://doi.org/10.1037/0033-2909.86.2.420.

Stewart, David W., William O. Bearden, Richard G. Netemeyer, Mary F. Mobley, Gordon C. Bruner, and Paul J. Hensel. 1993. “Handbook of Marketing Scales, Multi-Item Measures for Marketing and Consumer Behavior Research.” Journal of Marketing Research 30 (4): 525. https://doi.org/10.2307/3172696.

The Spearman–Brown prophecy formula gives the reliability of a test lengthened by a factor \(k\) as \(\rho_k = k\rho_1 / [1 + (k-1)\rho_1]\), where \(\rho_1\) is the reliability of the original test. It assumes the added items are parallel—equal true-score loadings and equal error variances—an assumption formative indicators violate by construction. ↩︎