36  Surveys and Experiments

Most of what marketing researchers want to know—what consumers intend to buy, how much they value an attribute, whether a campaign caused a sale—cannot be read directly off transactional data. It must be elicited or manufactured: elicited through surveys that ask people to report latent states, or manufactured through experiments that intervene on the world and observe the response. This chapter treats the two as a matched pair. Both are instruments for recovering a quantity that the analyst cannot observe, and both fail in characteristic, formalizable ways. The survey’s central problem is error—the gap between what respondents report and the population truth—decomposed into how respondents are selected and how they answer. The experiment’s central problem is validity—whether the measured effect is the causal effect, generalizes beyond the lab, and operationalizes the construct the researcher intended.

The two instruments also interact, and not benignly. The very act of surveying a consumer can change that consumer’s later behavior, so a survey designed to measure intentions can create the intentions it claims to observe. Recognizing such feedback is the first step toward using either instrument credibly. By the end of the chapter the reader should be able to state the total-survey-error decomposition, build an unbiased estimator from a known sampling design, correct cross-respondent incomparability with anchoring vignettes, design an experiment whose causal claim survives scrutiny, calculate a defensible sample size, and apply the open-science toolkit that guards against false discovery.

36.1 Surveys as Measurement

A survey is a measurement device whose output is a set of self-reports. Two properties make it powerful and dangerous in equal measure. It is cheap, so it scales to populations and to questions—intentions, attitudes, satisfaction—that leave no behavioral trace. And it is reactive: the instrument can perturb the state it measures.

36.1.1 Self-Generated Validity: When Measurement Changes Behavior

The canonical reactivity result concerns purchase intentions. Researchers routinely ask “How likely are you to buy product \(X\) in the next month?” and treat the answer as a noisy reading of a pre-existing propensity. Chandon, Morwitz, and Reinartz (2005) show that the reading is not passive. Merely asking the intentions question increases the correspondence between stated intentions and subsequent purchase—a phenomenon they call self-generated validity. The respondent, prompted to articulate an intention, forms or crystallizes an attitude that did not fully exist before the question and then acts on it. The intention–behavior link is therefore endogenous to the measurement: it is, on average, 58% stronger among surveyed consumers than among otherwise-comparable nonsurveyed consumers (Chandon, Morwitz, and Reinartz 2005). The implication is sharp. An intentions survey does not merely forecast demand; it is a light-touch intervention that nudges demand. A forecast calibrated on surveyed respondents will overstate the predictability of an unsurveyed population.

Self-generated validity. The increase in the predictive validity of a stated intention that is caused by the act of stating it. Asking the question reactively strengthens the very intention–behavior correlation the researcher intends to measure passively (Chandon, Morwitz, and Reinartz 2005).

36.1.2 The Intention–Behavior Correlation

How tightly do intentions track behavior once reactivity is acknowledged? The meta-analytic benchmark from the theory of reasoned action places the intention–behavior correlation near \(\rho \approx 0.53\) across domains (Sheppard, Hartwick, and Warshaw 1988). This is a substantial but far-from-deterministic association: stated intentions explain on the order of \(\rho^2 \approx 28\%\) of the variance in subsequent behavior. The residual is the space in which forecasting models earn their keep. Morwitz and Schmittlein (1992) show that segmenting before forecasting materially improves predictive power: a model that uses historical purchase behavior together with stated intention, and that allows the mapping from intention to purchase to differ across consumer segments, forecasts sales better than a pooled model that imposes a single intention–purchase slope. The intuition is that the intention–behavior gap is heterogeneous—deliberate planners realize their stated intentions far more reliably than impulsive or constrained consumers—so pooling averages over groups whose mapping functions genuinely differ.

The practical sequence is summarized in Figure 36.1.

flowchart LR
    A["Elicit stated<br/>intention"] --> B["Segment respondents<br/>(planners vs. impulsive,<br/>constrained, etc.)"]
    B --> C["Estimate segment-specific<br/>intention to purchase mapping"]
    C --> D["Aggregate to<br/>sales forecast"]
    A -. "self-generated<br/>validity" .-> E["Actual purchase<br/>behavior"]
    D --> E
Figure 36.1: From elicited intentions to a sales forecast. The dashed feedback edge is self-generated validity: the act of measurement perturbs the behavior being forecast.

36.2 The Total Survey Error Framework

A survey statistic can miss the truth for two broad reasons: the people who answer are not the people the analyst wanted, and the answers they give are not the values the analyst wanted. The total survey error (TSE) framework organizes every way a survey can go wrong under these two headings and, orthogonally, separates systematic error (bias) from sampling-driven dispersion (variance) (Groves and Lyberg 2010). Formally, for an estimator \(\hat\theta\) of a population target \(\theta\), the relevant loss is mean squared error, \[ \mathrm{MSE}(\hat\theta) = \underbrace{\bigl(\mathbb{E}[\hat\theta] - \theta\bigr)^2}_{\text{bias}^2} + \underbrace{\operatorname{Var}(\hat\theta)}_{\text{variance}}, \tag{36.1}\] and TSE decomposes both terms by source. The first decomposition is by stage of the survey process: \[ \text{Total Survey Error} = \underbrace{\text{Measurement Error}}_{\text{the answers}} + \underbrace{\text{Representation Error}}_{\text{the people}}. \tag{36.2}\]

The measurement branch tracks a single response from the target construct the analyst cares about, to the measurement the question actually taps, to the response the person gives, to the edited datum stored after processing—accruing validity error, response error, and processing error along the way. The representation branch tracks the units from the target population, to the sampling frame (the list from which units are drawn), to the sample, to the respondents who actually answer—accruing coverage error, sampling error, and nonresponse error. Figure 36.2 renders both branches.

flowchart TB
    subgraph M["Measurement (the answers)"]
        C1["Construct"] -->|validity| C2["Measurement"]
        C2 -->|response error| C3["Response"]
        C3 -->|processing error| C4["Edited response"]
    end
    subgraph R["Representation (the people)"]
        P1["Target population"] -->|coverage error| P2["Sampling frame"]
        P2 -->|sampling error| P3["Sample"]
        P3 -->|nonresponse error| P4["Respondents"]
    end
    C4 --> S["Survey statistic"]
    P4 --> S
Figure 36.2: The total survey error framework: a measurement (left) and a representation (right) branch, each accumulating distinct error sources from construct/population down to the final respondent-level datum. Adapted in original form from the decomposition in Groves and Lyberg (2010).

The framework is more than a taxonomy: it tells the analyst where to spend the next dollar. A statistic can be unbiased yet imprecise (a clean design with a small sample), or precise yet badly biased (a large but unrepresentative convenience panel). Because the two failure modes call for opposite remedies—more units versus a better frame—naming the dominant source is the prerequisite to fixing it.

36.2.1 Three Eras of Survey Practice

The TSE trade-offs have not been static; the technology of contacting respondents has reshaped which errors bind. Table 36.1 sketches the three eras now standard in survey methodology. The trajectory is one of falling cost and falling response rates: each era widened reach and cut expense while making the representation branch harder to control, culminating in non-probability online panels that must lean on modeling rather than design to justify inference.

Table 36.1: Three eras of survey research, after the design taxonomy popularized in survey methodology.
Era Sampling Interview mode Data environment
First Area probability Face-to-face Stand-alone
Second Random-digit-dial probability Telephone Stand-alone
Third Non-probability Computer-administered Linked

36.3 Probability and Non-Probability Sampling

The representation branch lives or dies on the sampling design. A sample is a probability sample if every unit in the frame has a known, non-zero probability of inclusion. This single property is what licenses design-based inference: because the inclusion probabilities are known, the analyst can reweight the sample to recover an unbiased estimate of the population total even when the design oversamples some units.

36.3.1 The Horvitz–Thompson Estimator

Let \(\pi_i = \Pr(i \in s)\) be unit \(i\)’s inclusion probability and let \(y_i\) be the quantity of interest. The Horvitz–Thompson estimator of the population mean weights each sampled observation by the inverse of its inclusion probability, \[ \hat{\bar y}_{\text{HT}} = \frac{1}{N}\sum_{i \in s} \frac{y_i}{\pi_i}, \tag{36.3}\] where \(N\) is the population size and \(s\) the realized sample. Inverse-probability weighting is the formal content of the slogan “weighting recovers representativeness”: a unit sampled at half the average rate stands in for twice as many population members and so receives twice the weight. The estimator is design-unbiased\(\mathbb{E}[\hat{\bar y}_{\text{HT}}] = \bar y\) under the randomization that generated \(s\)—for any design with strictly positive \(\pi_i\), which is precisely why the positivity requirement is non-negotiable.1

A short simulation makes the bias correction concrete. We construct a population in which the outcome is correlated with the inclusion probability—the worst case for a naive sample mean—and verify that the Horvitz–Thompson estimator recovers the truth while the unweighted mean does not.

Code
set.seed(31)

N <- 100000
# A covariate that drives BOTH the outcome and the sampling probability,
# so the naive sample mean is biased and weighting is required.
x      <- runif(N)
y      <- 2 + 3 * x + rnorm(N, sd = 0.5)        # population outcome
pop_mean <- mean(y)

# Unequal-probability design: high-x units are sampled more often.
pi_i   <- 0.02 + 0.08 * x                         # inclusion probability in (0.02, 0.10)
in_sample <- runif(N) < pi_i

y_s    <- y[in_sample]
pi_s   <- pi_i[in_sample]

naive_mean <- mean(y_s)                           # ignores the design -> biased
ht_mean    <- sum(y_s / pi_s) / N                 # Horvitz-Thompson -> unbiased

cat("Population mean:        ", round(pop_mean, 4), "\n")
#> Population mean:         3.4955
cat("Naive sample mean:      ", round(naive_mean, 4), "  (biased upward)\n")
#> Naive sample mean:       3.8335   (biased upward)
cat("Horvitz-Thompson mean:  ", round(ht_mean, 4),  "  (design-unbiased)\n")
#> Horvitz-Thompson mean:   3.4613   (design-unbiased)

36.3.2 The Non-Response Problem

The third-era pivot to non-probability and online sampling breaks the clean logic above, because the operative inclusion probability is no longer the design probability the analyst set but the response probability the respondent chose. Let \(R_i \in \{0,1\}\) indicate that unit \(i\) both is selected and chooses to respond, with response propensity \(\rho_i = \Pr(R_i = 1)\). The respondent-based estimator is unbiased only if the analyst knows \(\rho_i\), and in practice \(\rho_i\) must be estimated—typically by modeling response as a function of observed covariates. This is where non-response error enters: if response depends on the outcome \(y_i\) even after conditioning on observed covariates (non-response that is not missing at random), no reweighting on observables removes the bias, and the survey is identified only under an untestable assumption about the missingness mechanism. The TSE framework thus locates the modern survey’s hardest problem precisely: representation error migrates from a design quantity the analyst controls to a behavioral quantity the analyst must model (Groves and Lyberg 2010).

36.4 Online Panels and Data Quality

The practical home of modern marketing surveys is the online panel. Understanding platform quality is therefore a prerequisite, not an afterthought: the choice of panel determines which errors from Equation 36.2 dominate and can render even technically sound analyses uninterpretable.

36.4.1 The Platform Hierarchy

A now-extensive empirical literature benchmarks the major platforms. The headline finding is unambiguous: Prolific and CloudResearch substantially outperform Amazon Mechanical Turk (MTurk) on virtually every quality metric (Douglas, Ewell, and Brauer 2023; Peer et al. 2022). In the largest direct comparison, Douglas, Ewell, and Brauer (2023) recruited \(n = 2{,}729\) respondents across five platforms and embedded objective quality tests—video-recall accuracy, attention checks, unique device fingerprints, completion time, and meaningful open-ended responses. The results are striking. On a composite criterion requiring all quality tests to pass, Prolific achieved 67.9% high-quality participants and CloudResearch 61.9%, versus only 26.4% for MTurk. Video recall accuracy told the same story: 83.5% on Prolific and 81.8% on CloudResearch, compared to 52.2% on MTurk. Peer et al. (2022) corroborate this using a different design, finding that CloudResearch’s Approved participant pool produced implausible open-ended responses at 3.0%, versus 18.8% for a standard MTurk sample.

The source of MTurk’s decline is largely documented by Kennedy et al. (2021), who analyzed 38 studies (\(n = 24{,}930\) total, 2013–2018) and traced the quality crisis to a sharp rise in fraudulent respondents—non-US workers using virtual private servers (VPS) or VPN connections to circumvent US-only restrictions—from roughly 15% in 2015 to over 20% by late 2018, with individual studies hitting 26.9%. The quality damage is concentrated: legitimate respondents fail quality checks at 2.8%, whereas VPS users fail at 23.9% and verified non-US respondents at 32.0%. Critically, the traditional MTurk quality filter—requiring a 95% HIT acceptance rate and 100+ prior HITs completed—no longer provides meaningful protection, because fraudulent workers are experienced enough to satisfy these criteria.

The practical implication for researchers is direct: MTurk should not be the default for behavioral and attitudinal studies unless the researcher can verify participant authenticity through additional screening. Prolific, which pays at or above minimum wage and requires verified participant identities, or CloudResearch’s Approved pool, which screens out blocked participants, are the current recommended alternatives. For studies requiring nationally representative samples, Lucid and similar router-based panels offer quota matching but introduce their own quality heterogeneity and require careful vendor-specific screening.

36.4.2 Attention Checks and Quality Filters

Even on high-quality platforms, a fraction of respondents engages carelessly. Attention checks—items designed to detect inattentive responding—are the most common data-quality tool. Three types are in wide use:

  1. Instructed response items. The question instructs the respondent to select a specific answer (e.g., “For quality control, please select ‘Strongly agree’ below”). Failure rates on legitimate platforms are typically low (2–10%) but higher on MTurk.
  2. Reading comprehension checks. A paragraph of text is followed by questions whose correct answers appear verbatim in the passage. Inattentive respondents cannot answer correctly without reading.
  3. Screener traps / infrequency items. Items with no plausible true-positive response (e.g., “I have visited the moon”) flag extreme agreement as careless or dishonest.

A key design question is whether to warn respondents of attention checks in advance. Forewarning increases apparent pass rates but may induce strategic compliance rather than genuine engagement; researchers who forewarned found effects shrank relative to un-forewarned conditions in some paradigms. Current practice favors embedding checks unobtrusively without forewarning, placing them early and late in the survey to capture fatigue, and using multiple check types to reduce the chance that any single type creates a systematic artifact.

Two additional quality filters have become standard:

  • Completion time. Respondents who complete a survey far below the median time (e.g., less than one-third of median completion) are unlikely to have read all items. A conservative cutoff is removing the fastest 5–10% of completions after verifying the distribution is not bimodal (which would indicate a speeder mode distinct from the main distribution).
  • Duplicate detection. IP-address deduplication catches the same person completing multiple times; browser fingerprinting and geolocation checks catch coordinate fraud (many responses from the same physical location).

36.4.3 Survey Satisficing and Careless Responding

Beneath outright fraud lies a subtler quality threat: satisficing—the tendency to provide “good enough” rather than accurate answers (Krosnick 1991). Satisficing arises when the cognitive cost of careful responding exceeds the respondent’s motivation. It manifests in several behavioral signatures:

  • Straight-lining. Giving the same response to every item in a matrix question, ignoring item content. Detectable by computing the within-respondent standard deviation across a battery: a zero or near-zero value is diagnostic.
  • Non-differentiation. Using only a narrow range of scale points (e.g., responding 4 or 5 exclusively on a 1–7 scale) in a way that does not reflect genuine construct-level agreement.
  • Acquiescence bias. The tendency to agree with statements regardless of their content—sometimes called yea-saying. Acquiescence is partly a trait (stable across items) and partly a function of item wording and survey length. It can be diagnosed by including matched pairs of items worded in opposite directions: if the respondent agrees with both the item and its negation, they are acquiescing.
  • Midpoint endorsement / don’t-know overuse. Disproportionate use of the scale midpoint or neutral category, often as a low-effort response rather than a genuine expression of ambivalence.

Recommended detection pipeline: flag respondents with (a) a within-row SD of zero on any matrix battery longer than four items; (b) total survey completion below a reasonable minimum; and (c) an implausible open-ended response (e.g., a single character, gibberish, or a copied prompt). Each flag alone may be innocuous; two or more together strongly suggest careless responding. The analyst should report the number of flagged observations and sensitivity-check key results with and without them.

Code
library(tidyverse)
set.seed(42)

n_resp <- 500
n_items <- 8

# Simulate a matrix battery with some careless respondents
# Careful respondents: random responses around their true position
# Straight-liners: same value repeated (within-row SD = 0)
careful  <- matrix(sample(1:7, (n_resp - 50) * n_items, replace = TRUE),
                   nrow = n_resp - 50)
careless <- matrix(rep(sample(1:7, 50, replace = TRUE), each = n_items),
                   nrow = 50, byrow = TRUE)
battery  <- rbind(careful, careless)
colnames(battery) <- paste0("q", seq_len(n_items))

df <- as_tibble(battery) %>%
  mutate(respondent = row_number(),
         row_sd     = apply(battery, 1, sd),
         straightline = row_sd == 0)

cat("Straight-line flags detected:", sum(df$straightline),
    "of", n_resp, "respondents\n")
#> Straight-line flags detected: 50 of 500 respondents
cat("Expected (planted):", 50, "\n")
#> Expected (planted): 50

36.5 Response Styles and Systematic Measurement Biases

Beyond careless responding, response styles are systematic tendencies that distort scores even when respondents are fully engaged. They contribute to measurement error (Equation 36.2) because they introduce variance that is not attributable to the construct of interest.

36.5.1 Social Desirability

Social desirability bias is the tendency to give answers that present oneself favorably to others, regardless of one’s true attitudes or behaviors. It is especially consequential for sensitive constructs—health behavior, financial decisions, ethical attitudes, political views—where the socially approved answer diverges from the true answer. Two components are distinguished in the psychometric literature: impression management (conscious self-presentation) and self-deception enhancement (a genuine but inflated self-image). The Marlowe–Crowne Social Desirability Scale (Crowne and Marlowe 1960) and its abbreviated versions are the most widely used instruments for measuring and controlling social desirability as a covariate.

Design-based defenses are generally more effective than post-hoc statistical partialling. The bogus pipeline technique convinced respondents that a physiological device could detect dishonest answers, which substantially reduced social desirability in early studies but is now rarely used due to ethical concerns and the ease of disbelief. More practical alternatives include:

  • Anonymity assurances communicated prominently before sensitive questions.
  • Third-person phrasing (“Some people find that they… Do you?”), which reduces the normative force of the question.
  • Randomized response technique (RRT), where a random device (e.g., a coin flip unobserved by the researcher) determines whether the respondent answers the sensitive question or an innocuous one. The researcher can recover population prevalence estimates without knowing any individual’s true answer.
  • Indirect questioning through projective measures or implicit attitude tests, though the psychometric properties of many implicit measures remain debated.

36.5.2 Acquiescence Bias

As noted above, acquiescence systematically inflates agreement with positively worded items and deflates agreement with negatively worded items, creating a spurious construct mean and attenuating structural relationships. The standard remedy is balanced item wording: including an equal number of items worded in each direction and reverse-scoring the negatively worded items before averaging. Balanced scales do not eliminate acquiescence entirely—a persistent acquiescer will agree with both poles—but they ensure that the scale mean is acquiescence-neutral. A clean diagnostic is the polychoric correlation between matched positive–negative item pairs: strongly negative correlations confirm that the items are tapping the same underlying construct; near-zero correlations suggest the items are measuring unrelated things or that acquiescence is negligible.

36.5.3 Extreme Response Style

Extreme response style (ERS) is the tendency to endorse the scale endpoints (1 or 7 on a 7-point scale) regardless of item content. It is most pronounced in cross-cultural research, where ERS varies substantially across countries and can create spurious cross-national differences. Mitigation strategies include:

  • Using forced-choice or best–worst scaling formats that eliminate the option to cluster at extreme points.
  • Response style corrections that model ERS as a latent nuisance factor alongside the substantive construct, analogous to the common-method factor in Chapter 35.
  • Ranking formats that force differentiation across items.

36.6 Anchoring Vignettes

The measurement branch has a subtler pathology than noisy answers: respondents may use the same response scale to mean different things. Ask consumers to rate their satisfaction “on a 1–5 scale” and a lenient respondent’s 4 may encode the same underlying experience as a demanding respondent’s 2. This interpersonal incomparability—formally, differential item functioning (DIF)—means that differences across respondents partly reflect differences in how they map a latent state onto categories, not differences in the latent state itself.

Anchoring vignettes solve this by adding, alongside the self-assessment, a small set of fixed hypothetical scenarios that every respondent rates on the same scale (KING et al. 2004; King and Wand 2007; Hopkins and King 2010). The key move is that the analyst authors the vignettes, so their true level is fixed across respondents by construction. Any variation in how respondents rate the same vignette is therefore pure DIF—a direct reading of each respondent’s idiosyncratic scale use—which can then be netted out of the self-assessment.

Two assumptions identify the correction (KING et al. 2004):

  • Response consistency. Each respondent applies the same mapping from latent state to response category when rating the vignettes as when rating themselves.
  • Vignette equivalence. The latent level a given vignette describes is perceived identically by all respondents; only its translation into a category varies.

36.6.1 Nonparametric Correction

The nonparametric estimator requires no distributional assumptions. For each respondent, the analyst recodes the self-assessment \(y_i\) relative to where it falls among that respondent’s ordered vignette ratings \(z_{i1} \le z_{i2} \le \dots \le z_{iJ}\). The rescaled outcome \(C_i\) records the self-rating’s rank position within the respondent’s own vignette thresholds—how good the respondent is relative to anchors whose true level is known—so it is comparable across respondents by construction. Order inconsistencies, where a respondent’s vignette ratings are non-monotone in the vignettes’ known true order, are absorbed as ties.

36.6.2 Parametric Correction

The parametric estimator embeds the same idea in a likelihood. Self-assessment and vignettes are jointly modeled with an ordered probit in which the cut points vary across respondents as functions of covariates, with a respondent random effect capturing residual scale heterogeneity. Identification improves as the number of vignettes \(J\) grows, but there is a bias–variance trade-off: each additional vignette is measured with error, so adding vignettes trades sharper threshold identification against more measurement noise (Hopkins and King 2010).

Code
library(tidyverse)

# Three respondents with identical true ability differ only in rating strictness.
demo <- tribble(
  ~respondent, ~self, ~v1, ~v2, ~v3,
  "lenient",      5L,   3L,  4L,  5L,
  "moderate",     4L,   2L,  3L,  4L,
  "harsh",        3L,   1L,  2L,  3L
)

anchor_pos <- function(self, v) {
  v <- sort(v)
  paste0(sum(self > v), " above, ", sum(self == v), " tied")
}

demo %>%
  rowwise() %>%
  mutate(anchored = anchor_pos(self, c(v1, v2, v3))) %>%
  ungroup() %>%
  select(respondent, raw_self = self, v1, v2, v3, anchored)
#> # A tibble: 3 × 6
#>   respondent raw_self    v1    v2    v3 anchored       
#>   <chr>         <int> <int> <int> <int> <chr>          
#> 1 lenient           5     3     4     5 2 above, 1 tied
#> 2 moderate          4     2     3     4 2 above, 1 tied
#> 3 harsh             3     1     2     3 2 above, 1 tied

The raw self-ratings (5, 4, 3) look different, but every respondent rates themselves “2 above, 1 tied” relative to their own vignettes—anchoring reveals the apparent gap was pure rating strictness.

36.7 Experiments

Where a survey observes a latent state, an experiment manufactures variation in a cause and reads off its effect. The experiment’s payoff is causal identification: by randomly assigning the treatment, the analyst makes the treatment statistically independent of every confounder, observed or not, so the difference in mean outcomes estimates the average treatment effect rather than a confounded association. The cost is that the analyst must defend four distinct kinds of validity, manage the heterogeneity that averages conceal, choose a design appropriate to the research question, and respect the ethics of intervening on people.

36.7.1 Four Validities

A credible experiment clears four hurdles, summarized in Table 36.2. They are ordered roughly from “is there an effect at all” to “does it mean what we claim” to “does it travel.”

Table 36.2: The four validities an experiment must defend, in roughly increasing order of generality.
Validity Question it answers Characteristic threat
Statistical conclusion Is the treatment–outcome covariation real, not noise? Low power; fishing; violated test assumptions
Internal Is the covariation causal within the study? Confounding; attrition; failed randomization
Construct Does the operationalization capture the intended concept? Mono-operation bias; demand effects
External Does the effect generalize beyond this sample and setting? Non-representative subjects; artificial context

The four trade off against one another. Tight laboratory control buys internal validity at the expense of external validity (realism), while a field experiment buys realism at the expense of control. The same tension recurs in the practical design levers—cost, control, realism, and ethics—that no single design optimizes simultaneously.

36.7.2 Lab, Field, and Survey Experiments

Three broad settings host experimental research in marketing, each occupying a distinct position in the internal-vs.-external-validity trade-off.

Laboratory experiments (including online survey experiments) offer maximum control. The researcher assigns stimuli, controls timing, eliminates distractors, and can probe mechanism directly with manipulation checks and mediator measures. The price is reduced realism: stimuli are often hypothetical, settings unfamiliar, and participant pools (student samples, crowdsourcing panels) may not represent the population of interest. Most published causal claims in marketing rest on lab or online experiments, which makes their external validity an ongoing empirical question rather than an assumption.

Field experiments embed the treatment in an actual market. A/B tests on email campaigns, retail price promotions deployed to random zip codes, and randomized product launches are field experiments. Internal validity is preserved by randomization; external validity is substantially stronger because the setting, the stakes, and the population are real. The drawbacks are cost, limited measurement (typically only behavioral outcomes, not psychological processes), ethical exposure (firm and consumer stakes are real), and the difficulty of isolating mechanisms. The complementarity of lab and field is explicit: lab experiments establish that an effect can exist and identify the mechanism; field experiments establish that it does exist in the target environment (Gerber and Green 2012).

Survey experiments embed random assignment within an otherwise-standard survey questionnaire, assigning respondents to see different question wordings, orderings, or vignette descriptions. They are cheap and fast but inherit the survey’s weaknesses: the outcome is self-reported, respondents may recognize the experimental structure (demand effects), and stakes are hypothetical. Factorial survey experiments (vignette studies) present each respondent with a profile constructed by randomly varying several attributes simultaneously, allowing estimation of all main effects and interactions within a single study.

36.7.3 Demand Effects and Deception

Experimenter demand effects arise when participants infer the researcher’s hypothesis and adjust their responses accordingly—either cooperating (out of helpfulness) or doing the opposite (reactance). Demand effects constitute a construct validity threat: the observed treatment effect partly reflects participants responding to perceived social expectations rather than to the construct the treatment was designed to manipulate.

Diagnosing demand effects requires measuring participants’ awareness of the hypothesis. The Awareness Questionnaire approach (often called the funnel debriefing) asks participants, at the end of the study and in progressively more explicit terms, (a) what they thought the study was about, (b) what they thought the researcher’s hypothesis was, and (c) whether that suspicion influenced their responses. Effect-size estimates from participants who report hypothesis awareness can then be compared to those who do not; a large discrepancy is evidence of demand.

Three design strategies reduce demand effects:

  1. Cover stories. Present the study under a plausible but innocuous description that conceals the hypothesis. The cover story must be convincing enough not to trigger suspicion, and debriefing is required afterward if the cover constitutes deception.
  2. Between-subjects designs. Participants who see only one level of the independent variable cannot infer what the manipulation is comparing, making hypothesis detection harder. This advantage comes at a power cost (see Section 36.7.7).
  3. Behavioral outcomes. Demand effects are primarily a self-report pathology: participants cannot as easily adjust their actual behavior (how many M&Ms they take from a bowl, how much they bid at auction) as they can their stated opinions. When a behavioral outcome is available, it is preferable.

Deception is the deliberate use of false information—a cover story, a confederate, a fictitious scenario—to suppress demand or achieve experimental realism that cannot otherwise be manufactured. Deception is ethically permissible under standard IRB/ethics-review frameworks when (a) the research question cannot be answered by non-deceptive means, (b) the deception causes no lasting harm, and (c) participants are debriefed promptly and fully after data collection. The debriefing must not merely reveal the deception passively: it should explain why the deception was necessary, allow participants to withdraw their data with no penalty, and verify that they understand and are not disturbed. Some ethics frameworks (especially in Europe under GDPR-adjacent guidance) prohibit certain forms of deception outright regardless of debriefing.

36.7.4 Within-Person versus Between-Person Designs

The most fundamental structural decision in an experiment is whether each participant sees one condition (between-subjects or independent-groups design) or multiple conditions (within-subjects or repeated-measures design).

Between-subjects designs assign each participant to exactly one treatment cell. They are clean, immune to carryover, and eliminate demand effects that arise from participants contrasting conditions. Their weakness is statistical: to detect the same effect size, they require substantially more participants, because each person’s response is used exactly once. Variance between people (individual differences) is not isolated and becomes part of the error term.

Within-subjects designs have each participant experience multiple conditions (in counterbalanced or randomized order). By measuring the same person under each condition, the analyst can difference out individual-level constants, isolating the within-person treatment effect from stable individual differences. The statistical gain is large: within-person designs can detect the same effect with far fewer participants because the person serves as their own control. The costs are:

  • Carryover effects. Exposure to condition A may affect responses to condition B (e.g., fatigue, sensitization, contrast effects). Counterbalancing across participants controls for first-order carryover but not interactions.
  • Demand effects. Participants who experience multiple conditions can deduce the comparison, increasing hypothesis awareness.
  • Practice and fatigue. Performance may change systematically across trials regardless of the treatment.

The trade-off dictates: use within-subjects designs when the effect is expected to be small (power is paramount) and carryover is unlikely; use between-subjects when carryover or demand is a plausible confound.

Experience sampling methods (ESM), also called ecological momentary assessment (EMA), extend within-person measurement to the natural environment. Participants complete brief surveys multiple times per day, often triggered by a smartphone prompt, capturing states (mood, hunger, context) and behaviors as they occur. ESM is ideal for constructs that are highly state-dependent or that would be biased by retrospective recall (e.g., in-the-moment affect, impulse purchase triggers, media consumption). The design produces a multilevel data structure—many observations nested within each participant—requiring mixed-effects models that partition variance into within-person and between-person components and are analyzed in Chapter 30.

36.7.5 Conjoint and Discrete-Choice Experiments

When the research question concerns how multiple product attributes jointly determine consumer preference, the workhorse tool is the conjoint experiment. Respondents are shown a set of product profiles described by combinations of attributes (brand, price, warranty, green certification, etc.) and either rate each profile or choose among them. Because the attribute combinations are researcher-designed and orthogonally or near-orthogonally varied, the analyst can recover each attribute’s part-worth utility—its marginal contribution to overall preference—as if running a separate experiment on each attribute while holding others constant.

Two formulations dominate:

  1. Rating-based conjoint (traditional full-profile conjoint). Each profile is rated on a preference scale. Part-worths are recovered by regressing ratings on attribute dummies. Simple to implement but subject to rating-scale inconsistencies and demand effects if the attribute structure is transparent.

  2. Choice-based conjoint (CBC). Respondents choose their preferred option from a set of competing profiles, including an “opt-out” option. Choice is more realistic than rating—it mirrors actual purchase decisions—and the opt-out preserves revealed-preference logic. Part-worths are estimated by conditional logit or mixed logit, which allows heterogeneous preferences across respondents. The market simulation step maps part-worths to market-share predictions under alternative product configurations, making CBC the standard tool for new-product development and pricing research.

A critical design decision in any conjoint study is orthogonality: the attribute levels shown to any respondent should be uncorrelated across profiles, so that part-worths are identified without confounding. Fractional factorial designs achieve near-orthogonality with far fewer profiles than full-factorial enumeration would require, trading off some high-order interaction estimation for feasibility.

Best–worst scaling (BWS), or MaxDiff, is a simpler cousin: respondents choose the best and worst item from a set, producing a pair of anchored choices that is more discriminating than ratings and avoids the extreme response style biases of Likert-type scales. BWS produces individual-level utility scores and is increasingly preferred in applied research because it forces genuine differentiation.

36.7.6 Heterogeneity and Mechanisms

A single number—the average treatment effect (ATE)—can be an artifact. If the effect is positive for one segment and negative for another, the average may be near zero, near either extreme, or anywhere between, depending only on segment mix. Writing the conditional average treatment effect as \(\tau(x) = \mathbb{E}[Y(1) - Y(0)\mid X=x]\), the reported ATE is the mix-weighted average \(\int \tau(x)\,dF(x)\), which is silent about the sign and size of \(\tau(x)\) for any particular \(x\). The same caution that motivated segmentation in forecasting (Chapter 36) applies to experiments: heterogeneous effects demand that the analyst report the distribution of \(\tau(x)\), not only its mean, and that any claimed mechanism—the causal pathway through which the treatment acts—be argued rather than assumed.

Mediation analyses are the standard vehicle for mechanistic claims, but they carry assumptions of their own. A significant indirect effect (\(a \times b\) in the Baron–Kenny path notation) identifies the mechanism only if (a) there is no unmeasured common cause of the mediator and the outcome, and (b) the mediator is not on a back-door path from treatment to outcome. These assumptions are untestable from observational data on mediators and outcomes and motivate the use of sequential experimental designs that manipulate both the treatment and the putative mediator in separate studies to establish causal-chain evidence.

36.7.7 Power Analysis

Every experiment must answer the question: how many participants are needed to detect the effect of interest if it exists? Underpowered studies waste resources and expose participants to risk without a realistic chance of a conclusive result—a direct violation of the Reduce principle from the 3Rs framework (see below). They also produce inflated effect-size estimates when they do yield significant results (the “winner’s curse”), because only the largest noise-amplified estimates clear the significance threshold.

Conventional power analysis requires three inputs and produces a sample size:

  • \(\delta\): the effect size, usually Cohen’s \(d\) for means or \(f^2\) for regression,
  • \(\alpha\): the Type I error rate (usually 0.05), and
  • \(1-\beta\): the desired power (usually 0.80 or 0.90).

The mechanical calculation is available in R (pwr package) but is only as good as its inputs. The critical debate is about \(\delta\): what effect size should power calculations assume?

The smallest effect size of interest (SESOI) approach (Lakens, Scheel, and Isager 2018) replaces “what effect size will I find?” with “what is the smallest effect that would be theoretically or practically meaningful?” The SESOI is set on substantive grounds before data collection—for example, a standardized mean difference smaller than \(d = 0.2\) may be too small to justify the cost of a marketing intervention—and the study is powered to detect that threshold. This framing avoids the common error of powering studies on prior literature effect sizes that are themselves upward-biased by publication selection.

Simulation-based power analysis is more flexible than analytic formulas when the design is complex (multilevel, repeated-measures, multiple endpoints, or non-normal outcomes). The simulation encodes the assumed data-generating process, applies the planned analysis, and counts the fraction of simulated studies that return \(p < \alpha\).

Code
set.seed(2024)

# Simulation-based power for a between-subjects t-test
# SESOI: d = 0.30 (small but practically meaningful)

sim_power <- function(n_per_group, d, alpha = 0.05, nsim = 2000) {
  sig <- replicate(nsim, {
    group0 <- rnorm(n_per_group, mean = 0,   sd = 1)
    group1 <- rnorm(n_per_group, mean = d,   sd = 1)
    t.test(group1, group0, var.equal = TRUE)$p.value < alpha
  })
  mean(sig)
}

# Sweep over sample sizes
ns    <- c(50, 100, 150, 200, 300)
pows  <- sapply(ns, sim_power, d = 0.30)

results <- data.frame(n_per_group = ns, power = round(pows, 3))
print(results)
#>   n_per_group power
#> 1          50 0.312
#> 2         100 0.552
#> 3         150 0.745
#> 4         200 0.842
#> 5         300 0.954
cat("\nTarget: 80% power for d = 0.30 requires ~", ns[which(pows >= 0.80)[1]],
    "per group\n")
#> 
#> Target: 80% power for d = 0.30 requires ~ 200 per group

The simulation makes the researcher’s assumptions explicit and auditable. Standard practice is to pre-register the power analysis alongside the hypotheses, so the sample-size decision is on record before any data are seen. Reporting power analysis in published work has increased substantially in recent years—from roughly 9.5% of empirical psychology articles in 2015–2016 to 30% by 2020–2021 across 24 APA-listed journals (Adamkovičová, Čajča, and Vašíček 2024).

36.7.8 Sequential Testing

A limitation of the conventional null-hypothesis testing framework is that the \(\alpha\) level is calibrated for a fixed sample: collecting data until significance is reached, then stopping, inflates the true Type I error rate far above \(\alpha\). Sequential testing addresses this by specifying in advance when interim analyses will be conducted and how much \(\alpha\) each interim analysis “spends.”

Two approaches are in common use:

  1. Group sequential designs with formal \(\alpha\)-spending functions (e.g., O’Brien–Fleming, Pocock bounds). The analyst specifies the number of interim looks and a monotone spending function that allocates cumulative \(\alpha\) across looks; the critical value at each look is adjusted upward so that the overall Type I error remains at \(\alpha\). These designs are standard in clinical trials and increasingly used in A/B testing platforms.

  2. Sequential probability ratio tests (SPRT) and anytime-valid inference. The SPRT computes a likelihood ratio at every new observation and crosses a pre-specified boundary when evidence is sufficiently strong for either hypothesis. Recent extensions provide e-values and anytime-valid \(p\)-values that remain valid under optional stopping without adjustment, enabling “keep collecting data until we decide” designs with formal error control (Ramdas et al. 2023).

For marketing A/B testing with continuous data collection, the practical alternative is the always-valid inference framework, implemented in commercial platforms (e.g., Optimizely’s Stats Engine) and available in R. The analyst sets a minimum detectable effect and a maximum sample size, and the platform signals when the boundary is crossed in either direction or when the maximum is reached without crossing.

36.7.9 Ethics: The 3Rs

Experiments intervene on living subjects, and the governing ethical framework is the 3Rs of Russell and Burch (1959):2 Replace an invasive procedure with a less invasive (or non-animal, or natural) alternative; Refine procedures to reduce harm and distress; and Reduce the number of participants to the minimum consistent with a valid answer. The Reduce principle has a statistical edge: the minimum sample is set by the power calculation, so an underpowered study is not only wasteful but also unethical, exposing participants to risk without a credible chance of a conclusive result. The 3Rs underlie modern institutional review board (IRB) and research ethics committee requirements; any study involving human participants must secure approval before data collection and adhere to the approved protocol.

36.8 Open Science

The reproducibility crisis of the 2010s—in which a substantial fraction of landmark behavioral science findings failed to replicate in independent studies—forced a reckoning with the research practices that inflated false-positive rates. The result was an open-science infrastructure that is now the expected standard for credible empirical work.

36.8.1 Pre-Registration

Pre-registration time-stamps the hypotheses, design, and analysis plan before data collection begins, in a public registry (AsPredicted, OSF Registries). By separating the confirmatory analysis from the exploratory analysis, pre-registration prevents the two most damaging forms of analytical flexibility:

  • Hypothesis-after-results (HARKing): presenting an exploratory finding as if it were a pre-specified hypothesis, inflating the apparent prior probability that any given test was of theoretical interest.
  • Outcome switching: testing multiple outcomes and reporting only those that reached significance.

Pre-registration does not make studies immune to problems: it cannot ensure that the theory is correct, the manipulation is valid, or the measure is reliable. As Simmons, Nelson, and Simonsohn (2021) argue—with an analogy to random assignment—pre-registration is “a game changer” precisely because it is narrow: like random assignment, which controls confounds but cannot guarantee external validity, pre-registration controls analytical flexibility but cannot substitute for good measurement and adequate power. Non-pre-registered studies conducted transparently can still be credible; pre-registered studies with flawed designs are not credible because of the time-stamp.

The pre-registration should specify at minimum: the primary hypothesis and its directionality, the key measures and their operationalizations, the sample size and how it was determined, the statistical test and any covariates included, and the exclusion criteria applied before analysis. Deviations from the plan in the published paper should be disclosed and justified.

36.8.2 The \(p\)-Curve

The \(p\)-curve is the distribution of statistically significant \(p\)-values across a set of studies testing the same effect (Simonsohn, Nelson, and Simmons 2014). Under a true effect, that distribution is right-skewed—many very small \(p\)-values, because a true effect makes extreme test statistics likely. Under no effect but selective reporting (\(p\)-hacking), it is flat or left-skewed—a pile-up just below 0.05, because researchers publish only the highest values that clear the threshold. Inspecting the curve thus diagnoses whether a literature reflects real effects or selective analysis.

A right-skewed \(p\)-curve is the signature of a true effect; a flat or left-skewed curve is the signature of \(p\)-hacking. The latter does not mean no individual study is valid—it means the literature as collected cannot be trusted at face value.

The \(p\)-curve also supports effect-size estimation: the underlying effect can be estimated from the distribution of significant \(p\)-values alone, without access to unpublished studies (the file drawer). The estimate is biased upward because it conditions on significance, but the bias is lower than the naive published-studies mean.

36.8.3 Specification Curve Analysis

Pre-registration disciplines a single analysis. Specification curve analysis (SCA) asks what happens across all defensible analyses (Simonsohn, Simmons, and Nelson 2020). The researcher enumerates every combination of analysis choices that could reasonably be defended— different covariate sets, outcome transformations, sample exclusions, model families—and estimates the effect for each specification. The output is a graphical “specification curve” that plots every estimate in decreasing order, annotated with the choices that produced it.

SCA has two inferential uses. Descriptively, it reveals whether the focal result is robust—if the effect is consistently in the same direction and significant across the full curve, robustness is strong—or fragile, concentrated in a narrow slice of specifications. Inferentially, a permutation test under the null shuffles treatment assignment, recomputes the curve, and assesses whether the observed median estimate exceeds the permutation distribution; this tests whether the pattern of results is consistent with a null effect across the universe of reasonable analyses.

SCA is implemented in R (the specr package) and Stata and is increasingly expected for papers where specification choices are numerous or contentious. It is the multivariate generalization of a sensitivity analysis: instead of varying one choice at a time, it varies all choices simultaneously and characterizes the full consequence space.

36.8.4 Equivalence Testing

Standard null-hypothesis tests answer “is there any effect?” Equivalence testing answers “is the effect smaller than a meaningful threshold?” The two one-sided tests (TOST) procedure (Lakens 2017) tests the null hypothesis that the effect is at least as large as the SESOI in absolute value. Both one-sided tests must reject at \(\alpha\) for the analyst to conclude equivalence to zero within the equivalence bounds. Formally, for effect size \(\hat\delta\) and equivalence bounds \([-\Delta, +\Delta]\):

\[ H_{0,1}: \delta \le -\Delta \qquad H_{0,2}: \delta \ge +\Delta. \]

Rejecting both \(H_{0,1}\) and \(H_{0,2}\) establishes that the true effect falls within \((-\Delta, +\Delta)\)—practically equivalent to zero by the researcher’s pre-specified standard. TOST is the appropriate analysis when a researcher wants to claim that a proposed replication failed or that an intervention has no practically meaningful effect. It is not the same as “failing to reject the null”: a non-significant standard \(t\)-test is uninformative about whether the effect is small; TOST is informative precisely because it has power against the specific alternative that the effect exceeds \(\Delta\).

Code
# TOST equivalence test for a between-subjects comparison
# H0: effect >= delta (either side)
# Equivalence bound: d = 0.20 (effect below this is "practically zero")
set.seed(99)

n     <- 200
group0 <- rnorm(n, mean = 0.00, sd = 1)
group1 <- rnorm(n, mean = 0.05, sd = 1)   # true d ≈ 0.05, near zero

delta  <- 0.20   # equivalence bound in raw units (same SD = 1)

# Two one-sided t-tests
t_lower <- t.test(group1 - group0, mu = -delta, alternative = "greater")
t_upper <- t.test(group1 - group0, mu =  delta, alternative = "less")

cat("TOST lower bound p-value:", round(t_lower$p.value, 4), "\n")
#> TOST lower bound p-value: 0.0018
cat("TOST upper bound p-value:", round(t_upper$p.value, 4), "\n")
#> TOST upper bound p-value: 0.134
cat("Equivalence established (both p < .05):",
    t_lower$p.value < 0.05 & t_upper$p.value < 0.05, "\n")
#> Equivalence established (both p < .05): FALSE

36.8.5 Registered Reports

The most radical open-science reform is the registered report format, now offered by over 300 journals. In a registered report, the introduction, literature review, and methods are peer-reviewed before data collection; acceptance is provisional on the approved methodology. After data collection, the results and discussion are added and the paper is published regardless of whether the outcome is significant, null, or mixed—provided the registered methods were followed.

Registered reports eliminate publication bias at the source by decoupling acceptance from results. They also incentivize well-powered designs with valid measures, because reviewers evaluate only the pre-data sections. Meta-analytic evidence comparing registered versus non-registered studies in the same journals finds that registered reports show substantially smaller effect sizes on average, consistent with the hypothesis that the larger effects in the conventional literature are upward-biased by selective publication. For marketing researchers, a growing number of journals in the field (including the Journal of Marketing Research and Journal of Consumer Research) now accept registered reports.

36.8.6 Open Data, Materials, and Code

Pre-registration governs what was planned; open data, materials, and code govern what was done. The FAIR principles (Findable, Accessible, Interoperable, Reusable) provide an operational standard for data sharing. Platforms include:

  • Open Science Framework (OSF): the dominant repository for behavioral science, supporting pre-registration, data hosting, and materials sharing under versioned DOIs.
  • ResearchBox: a packaging system that bundles data, analysis code, and materials in a single citable artifact, making exact replication straightforward.
  • AsPredicted: focused specifically on pre-registration with a minimal interface.

Open materials mean that a reviewer or reader can audit whether the stimuli, measures, and procedures match the paper’s description. Open analysis code means that every reported number traces to a reproducible computation. Together, these practices address the “computational reproducibility” dimension of the crisis: the finding that a substantial fraction of published statistical results cannot be reproduced from the reported data even without any scientific dispute about the methods.

36.9 Key Takeaways

  • Surveys and experiments are both instruments for recovering an unobserved quantity; surveys fail through error (Equation 36.2) and experiments through threats to validity (Table 36.2).
  • Measurement is reactive: asking about purchase intentions strengthens the intention–behavior link by roughly 58%, so intentions surveys are mild interventions, not passive forecasts (Chandon, Morwitz, and Reinartz 2005).
  • Total survey error separates measurement error (the answers) from representation error (the people), and bias from variance; naming the dominant source dictates the fix (Groves and Lyberg 2010).
  • Probability designs license design-based inference through the Horvitz–Thompson estimator (Equation 36.3); non-probability and non-response designs shift the burden from design to modeling the inclusion mechanism.
  • Online panels are not interchangeable: Prolific and CloudResearch substantially outperform MTurk on composite quality metrics (67.9%/61.9% vs. 26.4% high-quality participants), and traditional MTurk filters—95% HIT acceptance, 100+ HITs—no longer recover historical quality (Douglas, Ewell, and Brauer 2023; Peer et al. 2022; Kennedy et al. 2021).
  • Satisficing (straight-lining, acquiescence, speeding) degrades measurement quality even on high-quality platforms; a multi-indicator quality pipeline (SD flags, time filters, open-ended checks) should be applied before analysis.
  • Response styles—social desirability, acquiescence bias, extreme response tendency—introduce systematic variance unrelated to the construct; balanced item wording and design-based remedies (anonymity, RRT, behavioral outcomes) are preferred over post-hoc statistical corrections.
  • Anchoring vignettes purge interpersonal incomparability by netting out differential item functioning, identified by response consistency and vignette equivalence (KING et al. 2004).
  • Causal claims rest on four validities; demand effects threaten construct validity and are mitigated by cover stories, between-subjects designs, and behavioral outcomes.
  • Conjoint experiments (especially choice-based CBC) recover attribute part-worths under realistic competitive choice contexts; best–worst scaling avoids extreme response biases in preference measurement.
  • Within-subjects designs dramatically increase power by removing individual differences from the error term; experience sampling extends within-person measurement to naturalistic settings via repeated smartphone surveys.
  • Power analysis should target the smallest effect size of interest (SESOI), not the expected effect; simulation-based power makes distributional assumptions explicit (Lakens, Scheel, and Isager 2018).
  • Sequential testing with \(\alpha\)-spending functions or anytime-valid inference allows data collection to stop early when evidence is sufficient, without inflating Type I error.
  • Pre-registration eliminates HARKing and outcome switching; specification curve analysis characterizes robustness across all defensible specifications (Simonsohn, Simmons, and Nelson 2020); TOST establishes practical equivalence to zero (Lakens 2017); and registered reports eliminate publication bias at the acceptance stage.

36.10 Further Reading

For total survey error the canonical reference is Groves and Lyberg (2010); Krosnick (1991) is the foundational treatment of satisficing and cognitive aspects of survey response. Peer et al. (2022) and Douglas, Ewell, and Brauer (2023) are the current benchmarks for online panel quality comparisons; Kennedy et al. (2021) documents the MTurk fraud timeline. For experiment design, Gerber and Green (2012) is the definitive treatment of field experiments; Shadish, Cook, and Campbell (2002) covers the four-validity framework in depth. The open-science toolkit is developed in Simonsohn, Nelson, and Simmons (2014) (\(p\)-curve), Simonsohn, Simmons, and Nelson (2020) (specification curve), Lakens (2017) and Lakens, Scheel, and Isager (2018) (equivalence testing and SESOI), and Simmons, Nelson, and Simonsohn (2021) (pre-registration as game changer). Conjoint methods are treated comprehensively in Orme (2020) and in the academic review of Green and Srinivasan (1990). For within-person and ESM designs, Bolger and Laurenceau (2013) provides the multilevel framework. The measurement scales literature that underlies survey item development is treated in Chapter 35.

Adamkovičová, Marta, Viktor Čajča, and Jiří Vašíček. 2024. “Power Analysis Reporting in Psychological Research: A Systematic Review.” Advances in Methods and Practices in Psychological Science. https://doi.org/10.1177/25152459241240722.
Bolger, Niall, and Jean-Philippe Laurenceau. 2013. Intensive Longitudinal Methods: An Introduction to Experience Sampling and Diary Methods. New York: Guilford Press.
Chandon, Pierre, Vicki G. Morwitz, and Werner J. Reinartz. 2005. “Do Intentions Really Predict Behavior? Self-Generated Validity Effects in Survey Research.” Journal of Marketing 69 (2): 1–14. https://doi.org/10.1509/jmkg.69.2.1.60755.
Crowne, Douglas P., and David Marlowe. 1960. “A New Scale of Social Desirability Independent of Psychopathology.” Journal of Consulting Psychology 24 (4): 349–54. https://doi.org/10.1037/h0047358.
Douglas, Brent D., Patrick J. Ewell, and Markus Brauer. 2023. “Data Quality in Online Human-Subjects Research: Comparisons Between MTurk, Prolific, CloudResearch, Qualtrics, and SONA.” PLOS ONE 18 (3): e0279720. https://doi.org/10.1371/journal.pone.0279720.
Gerber, Alan S., and Donald P. Green. 2012. Field Experiments: Design, Analysis, and Interpretation. New York: W. W. Norton & Company.
Green, Paul E., and V. Srinivasan. 1990. “Conjoint Analysis in Marketing: New Developments with Implications for Research and Practice.” Journal of Marketing 54 (4): 3–19. https://doi.org/10.2307/1251756.
Groves, R. M., and L. Lyberg. 2010. “Total Survey Error: Past, Present, and Future.” Public Opinion Quarterly 74 (5): 849–79. https://doi.org/10.1093/poq/nfq065.
Hopkins, D. J., and G. King. 2010. “Improving Anchoring Vignettes: Designing Surveys to Correct Interpersonal Incomparability.” Public Opinion Quarterly 74 (2): 201–22. https://doi.org/10.1093/poq/nfq011.
Kennedy, Ryan, Scott Clifford, Tyler Burleigh, Philip D. Waggoner, Ryan Jewell, and John H. Aldrich. 2021. “The Shape of and Solutions to the MTurk Quality Crisis.” Political Science Research and Methods 9 (1): 38–54. https://doi.org/10.1017/psrm.2020.6.
KING, GARY, CHRISTOPHER J. L. MURRAY, JOSHUA A. SALOMON, and AJAY TANDON. 2004. “Enhancing the Validity and Cross-Cultural Comparability of Measurement in Survey Research.” American Political Science Review 98 (1): 191–207. https://doi.org/10.1017/s000305540400108x.
King, Gary, and Jonathan Wand. 2007. “Comparing Incomparable Survey Responses: Evaluating and Selecting Anchoring Vignettes.” Political Analysis 15 (1): 46–66. https://doi.org/10.1093/pan/mpl011.
Krosnick, Jon A. 1991. “Response Strategies for Coping with the Cognitive Demands of Attitude Measures in Surveys.” Applied Cognitive Psychology 5 (3): 213–36. https://doi.org/10.1002/acp.2350050305.
Lakens, Daniël. 2017. “Equivalence Tests: A Practical Primer for \(t\) Tests, Correlations, and Meta-Analyses.” Social Psychological and Personality Science 8 (4): 355–62. https://doi.org/10.1177/1948550617697177.
Lakens, Daniël, Anne M. Scheel, and Peder M. Isager. 2018. “Equivalence Testing for Psychological Research: A Tutorial.” Advances in Methods and Practices in Psychological Science 1 (2): 259–69. https://doi.org/10.1177/2515245918770963.
Morwitz, Vicki G., and David Schmittlein. 1992. “Using Segmentation to Improve Sales Forecasts Based on Purchase Intent: Which "Intenders" Actually Buy?” Journal of Marketing Research 29 (4): 391. https://doi.org/10.2307/3172706.
Orme, Bryan K. 2020. Getting Started with Conjoint Analysis: Strategies for Product Design and Pricing Research. 4th ed. Manhattan Beach, CA: Research Publishers.
Peer, Eyal, David Rothschild, Andrew Gordon, Zak Evernden, and Ekaterina Damer. 2022. “Data Quality of Platforms and Panels for Online Behavioral Research.” Behavior Research Methods 54 (4): 1643–62. https://doi.org/10.3758/s13428-021-01694-3.
Ramdas, Aaditya, Peter Grünwald, Vladimir Vovk, and Glenn Shafer. 2023. “Game-Theoretic Statistics and Safe Anytime-Valid Inference.” Statistical Science 38 (4): 576–601. https://doi.org/10.1214/23-STS894.
Shadish, William R., Thomas D. Cook, and Donald T. Campbell. 2002. Experimental and Quasi-Experimental Designs for Generalized Causal Inference. Boston: Houghton Mifflin.
Sheppard, Blair H., Jon Hartwick, and Paul R. Warshaw. 1988. “The Theory of Reasoned Action: A Meta-Analysis of Past Research with Recommendations for Modifications and Future Research.” Journal of Consumer Research 15 (3): 325. https://doi.org/10.1086/209170.
Simmons, Joseph P., Leif D. Nelson, and Uri Simonsohn. 2021. “Pre-Registration Is a Game Changer. But, Like Random Assignment, It Is Neither Necessary nor Sufficient for Credible Science.” Journal of Consumer Psychology 31 (1): 177–80. https://doi.org/10.1002/jcpy.1207.
Simonsohn, Uri, Leif D. Nelson, and Joseph P. Simmons. 2014. “P-Curve: A Key to the File-Drawer.” Journal of Experimental Psychology: General 143 (2): 534–47. https://doi.org/10.1037/a0033242.
Simonsohn, Uri, Joseph P. Simmons, and Leif D. Nelson. 2020. “Specification Curve Analysis.” Nature Human Behaviour 4 (11): 1208–14. https://doi.org/10.1038/s41562-020-0912-z.

  1. If \(\pi_i = 0\) for some unit, that unit can never enter the sample and no weight can resurrect it; the corresponding stratum is structurally invisible and no amount of reweighting recovers it. This is the formal statement of coverage error in Equation 36.2.↩︎

  2. Russell, W. M. S. and Burch, R. L. (1959), The Principles of Humane Experimental Technique. The 3Rs framework predates and underlies most modern research-ethics review.↩︎