68 The Review Process

Peer review is the institution through which a scholarly field decides what counts as knowledge. A manuscript becomes a published contribution only after anonymous experts have judged it interesting, valid, and important enough to add to the record, and the editor has aggregated those judgments into a decision. For the author the process is a gauntlet; for the field it is a screening technology that trades speed for reliability. This chapter treats reviewing from both sides of the desk. It explains what an editor and a set of reviewers actually do to a submission, how a competent review is structured, and—because reviewing is at bottom a measurement problem—how to think formally about the reliability of the verdicts that emerge.

The chapter is also practical. A marketing PhD will referee dozens of papers before ever publishing many, and learning to review well is the fastest route to understanding what separates a strong paper from a rejected one. We therefore move from the motivation for reviewing, to the mechanics of the editorial pipeline, to the anatomy of a good referee report, to a statistical account of reviewer agreement and the noise inherent in the system. Throughout, the standard exemplar is the Journal of Marketing (JM), whose published review policy is unusually explicit; the principles transfer to the rest of the top-tier marketing journals (Chapter 1).¹

68.1 Why Review

Refereeing is unpaid, anonymous, and invisible on a CV unless deliberately documented. Scholars nonetheless review for three self-interested reasons that align neatly with the field’s interest. The first is reciprocity: peer review is a commons. Every author consumes referee labor and is expected to supply it, and a field that under-supplies reviewing degrades the screening quality of every journal in it. The second is proximity to the frontier: a referee reads work months or years before it appears in print, and reviewing for a strong journal is the cheapest subscription to the research frontier that exists. The third is calibration: nothing teaches the difference between a publishable and an unpublishable paper faster than being forced to articulate, in writing and under an editor’s scrutiny, why a specific manuscript falls on one side of that line. The skills that make a good reviewer—identifying the contribution, stress-testing identification, separating fatal flaws from fixable ones—are exactly the skills that make a good author.

Because the labor is invisible, it must be made visible to count. Reviewing activity can be logged on services that verify referee reports with journals (for example by forwarding the journal’s acknowledgment email to a verification address), producing a citable record for promotion files and annual reviews. The mechanism matters less than the discipline of recording the work; an unrecorded review is a contribution the tenure committee never sees.

68.2 The Editorial Pipeline

A submission does not go straight to reviewers. It passes through a sequence of gates, each of which can terminate the process, and understanding that sequence demystifies both the long latencies authors experience and the points at which a paper is most likely to die. Figure 68.1 traces the path from submission to decision.

flowchart TD
    A[Author submits] --> B{Plagiarism &<br/>duplicate check}
    B -->|fail| X[Returned to author]
    B -->|pass| C{Editor screens<br/>for suitability}
    C -->|unsuitable| D[Desk reject]
    C -->|suitable| E[Assign area /<br/>associate editor]
    E --> F[AE assigns<br/>2-3 reviewers]
    F --> G[Reviewers submit<br/>independent reports]
    G --> H[AE synthesizes<br/>recommendation]
    H --> I{Editor decision}
    I -->|reject| J[Reject]
    I -->|major/minor| K[Revise &<br/>resubmit]
    I -->|accept| L[Accept]
    K --> M[Author revises] --> F

Figure 68.1: The editorial pipeline at a top marketing journal. Each diamond is a gate at which the manuscript can exit; most attrition occurs at desk screening and at the first round of review.

The first two gates are administrative. Submissions are screened by software for plagiarism and for duplicate or concurrent submission, the latter being a serious ethical violation because it wastes scarce referee capacity across journals. A manuscript that clears these is read by the editor (or editor-in-chief) for suitability: is the topic in scope, is the contribution plausibly large enough, is the execution credible on its face? A large share of submissions never survive this gate—they are desk-rejected without external review—which is efficient for the system but is the single most common author experience at selective journals.

A suitable paper is assigned to an area editor or associate editor (AE), a domain specialist who owns the manuscript through the rest of the process. The AE selects two or three reviewers. Reviewer selection is not random: candidates are drawn from the paper’s own reference list (authors of cited work are, by construction, close to the topic), from the author’s suggested-reviewer list, from the AE’s personal knowledge of the subfield, from candidates who declined earlier invitations and suggested alternatives, from web search, and from the publisher’s reviewer database. This selection mechanism has a consequence worth stating plainly: reviewers are chosen for topical proximity, not for statistical representativeness, so the pool of opinions an author receives is a convenience sample of nearby experts, not a random draw from the field.

Reviewers return independent written reports, and the AE synthesizes them into a single recommendation for the editor. Synthesis is not vote counting. A paper is not rejected because two of three reviewers were negative; it is rejected because the AE and editor, reading the reports as arguments, judge the objections dispositive. We return to why vote counting is the wrong aggregation rule in Section 68.6.

68.2.1 Decisions and Turnaround

The editor’s decision falls into a small set of categories. An outright reject ends the relationship with the journal. An accept is rare on a first round. The common non-terminal outcome is a revise and resubmit (R&R), of which two flavors must be distinguished because authors routinely confuse them.

Table 68.1: First-round decision categories. The two R&R variants share a label but differ in whether the revision re-enters with the original referees or starts over.

Outcome	Treated as	Returns to	Author expectation
Accept	Final	Production	Rare on round 1
Reject	Final	—	Most common at selective journals
Revise & resubmit (revise)	Continuation	Same reviewers	Address numbered points; reasonable odds
Reject & resubmit	New submission	New reviewers	Fresh clock, fresh referees, no continuity

The distinction in Table 68.1 is consequential. A revise-and-resubmit returns to the same reviewers, who will check whether their numbered concerns were addressed; continuity of referees rewards a disciplined, point-by-point response. A reject-and-resubmit is administratively a new paper—it resets the clock and is sent to a new set of reviewers—so the prior reports are guidance, not a binding contract, and the author cannot assume the new panel shares the old panel’s priorities. Turnaround norms at top journals target roughly three weeks for a reviewer to return a report after accepting the invitation and on the order of two weeks for the AE to synthesize, with the full first round commonly completing within about three months. Critically, an invited reviewer should decline promptly if unable to serve: a slow non-answer is more damaging to an author than a fast decline, because it delays reassignment.

68.3 What a Manuscript Must Deliver

Before turning to how a review is written, it helps to fix the bar a manuscript is held to. The published criteria at JM reduce to three properties a publishable article must simultaneously possess.

A publishable article offers new knowledge, addresses real-world marketing topics and problems of relevance to identifiable stakeholders, and exhibits validity—a good approximation of the truth.

These three—novelty, relevance, validity—are not independently sufficient. A methodologically flawless paper that tells the field nothing new fails on novelty; a novel and rigorous paper about a phenomenon no stakeholder cares about fails on relevance; a novel and important paper whose inference does not survive scrutiny fails on validity, which is usually fatal because validity, unlike novelty or relevance, cannot be repaired by reframing. Much of what a reviewer does is to locate which of these three a paper is weakest on and judge whether the weakness is fixable within a revision cycle.

68.4 The Process Reviewers Are Held To

The author is not the only party under evaluation; the review itself is held to standards, and an AE who receives a careless or unfair report will discount it. Table 68.2 collects the properties a competent review must satisfy. These are obligations on the referee, not on the manuscript.

Table 68.2: Standards the review process imposes on reviewers (JM editorial practice).

Standard	What it requires of the reviewer
Fairness	Judge the paper the authors wrote, against its stated objective
Competence	Possess reasonable domain knowledge for the claims at issue
Independence	Disclose and recuse on conflicts of interest
Consistency	Weigh interest, empirical rigor, conceptual rigor, relevance, constructiveness, and tolerance for risk
Non-vote-counting	Supply arguments, not a verdict count, for the AE to weigh
Responsibility	Respect the original objective, explain fatal flaws explicitly, preserve confidentiality
Timeliness	Reviewers ~3 weeks post-acceptance; AEs ~2 weeks to synthesize
Roadmap	Tell the author what a successful revision would look like

Two of these deserve emphasis because they are the most often violated. Consistency means a reviewer should reward risk-taking: a paper that attempts something hard and important and partly succeeds can be a larger contribution than a paper that executes a small, safe question flawlessly, and a review that mechanically penalizes ambition starves the field of its most valuable submissions. Responsibility means that when a reviewer believes a flaw is fatal, the report must explain why, in enough detail that the author understands the objection rather than merely the verdict; an unexplained rejection is a failure of the reviewer, not a service to the field. The combination of these standards with a roadmap—an explicit statement of what a successful revision would contain—is what converts a review from a judgment into a constructive instrument.

68.5 Anatomy of a Referee Report

A good report is short, numbered, and organized so the AE can extract a recommendation and the author can extract a to-do list. The conventional length is about two single-spaced pages, and comments are numbered so that author and editor can reference them precisely in correspondence. Figure 68.2 shows the canonical structure, which mirrors the three-paragraph template many editors recommend: contextualize, evaluate, then enumerate issues.

flowchart LR
    A[Synopsis<br/>of paper &<br/>findings] --> B[Contribution<br/>assessment]
    B --> C[Major comments<br/>conceptual /<br/>empirical]
    C --> D[Readability]
    D --> E[Minor comments<br/>numbered]

Figure 68.2: Canonical structure of a referee report. The synopsis and contribution assessment frame the review; the numbered comments, separated into major and minor, carry the actionable content.

The opening paragraph contextualizes the research, conveys the key message of the paper in the reviewer’s own words, and notes what was done well. Restating the paper’s thesis is not a courtesy; it lets the author and editor verify that the reviewer understood the contribution before critiquing it, and a reviewer who cannot summarize the paper accurately has forfeited the authority to reject it. The middle of the report evaluates the work along the dimensions the field cares about: is the paper well written, interesting, and important; are the methods, data, and design appropriate; and—the decisive empirical question—does the data support the findings? The body then enumerates issues, and the most useful reports separate the major comments (factual errors, invalid arguments, identification problems, claims unsupported by the evidence) from minor ones (exposition, length relative to the journal’s norms, missing references), and pair each major objection with a path to improvement wherever one exists.

The distinction between major and minor is not cosmetic. An AE reading three reports needs to know which objections are dispositive and which are polish, and a report that buries a fatal identification problem in a list of typographical corrections has failed to communicate its own conclusion. The closing of a strong report is the roadmap from Table 68.2: a concrete statement of what the authors would have to demonstrate for the reviewer to change a negative recommendation to a positive one.

68.6 Reviewing as Measurement

Peer review is, formally, a measurement procedure: several raters observe the same object (the manuscript) and emit ordinal judgments, and the system aggregates those judgments into a decision. Viewing it this way makes precise why vote counting is the wrong aggregation rule and why even a well-run process is noisy.

Let a manuscript \(i\) have an unobserved latent quality \(q_i \in \mathbb{R}\). Reviewer \(j\) does not observe \(q_i\); she observes a noisy signal and reports a rating \[ r_{ij} = q_i + b_j + \varepsilon_{ij}, \tag{68.1}\] where \(b_j\) is reviewer \(j\)’s systematic leniency or severity (a fixed rater effect) and \(\varepsilon_{ij}\) is idiosyncratic noise with variance \(\sigma^2_\varepsilon\). Equation Equation 68.1 is the linear measurement model that underlies every reliability statistic below. Two features of real review panels map directly onto its terms. First, reviewers differ in severity \(b_j\)—some referees reject almost everything—so a paper’s fate depends partly on the luck of the draw in Figure 68.1. Second, because reviewers are selected for topical proximity rather than at random, their errors \(\varepsilon_{ij}\) need not be independent: two referees from the same methodological camp may share a blind spot, violating the independence that naive aggregation assumes.

This is precisely why synthesis is not vote counting. Under Equation 68.1, the AE’s task is to estimate \(q_i\) from the reports, and a count of “rejects” throws away exactly the information that distinguishes a severe reviewer’s reflexive negativity (\(b_j \ll 0\)) from a substantive, dispositive objection (a genuinely low \(q_i\)). A reasoned synthesis reads the arguments—the content of \(\varepsilon_{ij}\), in effect—and down-weights severity that is not backed by a defensible flaw. The formal model thus rationalizes the editorial insistence that recommendations carry arguments rather than verdicts.

68.6.1 Quantifying Agreement

How much do reviewers actually agree? For continuous or quasi-continuous ratings, the natural summary is the intraclass correlation coefficient (ICC), the share of total rating variance attributable to true between-manuscript differences rather than to rater effects and noise (Shrout and Fleiss 1979). Under Equation 68.1 with manuscript variance \(\sigma^2_q\), the single-rater reliability is \[ \text{ICC} = \frac{\sigma^2_q}{\sigma^2_q + \sigma^2_b + \sigma^2_\varepsilon}, \tag{68.2}\] which falls toward zero as rater severity (\(\sigma^2_b\)) and idiosyncratic noise (\(\sigma^2_\varepsilon\)) swamp the genuine quality signal. Equation Equation 68.2 also explains why journals assign multiple reviewers: averaging \(m\) independent reports reduces the noise term to \(\sigma^2_\varepsilon/m\), so the reliability of the mean rating rises with the panel size—the Spearman–Brown logic—provided the errors really are independent, which the selection mechanism above puts in doubt.

For categorical verdicts (accept / R&R / reject), agreement is measured by Cohen’s \(\kappa\), which corrects the raw proportion of agreement for the agreement expected by chance (Cohen 1960). With observed agreement \(p_o\) and chance agreement \(p_e\), \[ \kappa = \frac{p_o - p_e}{1 - p_e}, \tag{68.3}\] so \(\kappa = 1\) denotes perfect agreement, \(\kappa = 0\) denotes agreement no better than chance, and negative values denote systematic disagreement. The conventional benchmarks for interpreting \(\kappa\) come from Landis and Koch (1977): values in \([0.21, 0.40]\) are “fair,” \([0.41, 0.60]\) “moderate,” \([0.61, 0.80]\) “substantial.” Empirical studies of peer review across fields routinely land in the fair-to-moderate range, which is sobering: it means the categorical verdict on a borderline paper carries real stochastic content, and the decision an author receives is one draw from a distribution rather than a deterministic reading of merit.

68.6.2 A Worked Example

The code below simulates the measurement model in Equation 68.1 and reports both reliability statistics, making concrete how rater severity and noise erode agreement. We draw latent qualities for a set of manuscripts, add reviewer-specific severity and idiosyncratic noise, threshold the continuous ratings into accept/R&R/reject verdicts, and compute the ICC of the continuous scores and Cohen’s \(\kappa\) of the categorical verdicts for a pair of reviewers.

Code

set.seed(42)

n_papers    <- 200          # manuscripts
n_reviewers <- 3            # reviewers per paper
sigma_q     <- 1.0          # SD of true manuscript quality
sigma_b     <- 0.6          # SD of reviewer severity (fixed rater effect)
sigma_e     <- 0.8          # SD of idiosyncratic rating noise

# True latent quality of each manuscript, and each reviewer's severity.
q <- rnorm(n_papers, mean = 0, sd = sigma_q)
b <- rnorm(n_reviewers, mean = 0, sd = sigma_b)

# Ratings r_ij = q_i + b_j + e_ij  (rows = papers, cols = reviewers)
ratings <- sapply(seq_len(n_reviewers), function(j) {
  q + b[j] + rnorm(n_papers, mean = 0, sd = sigma_e)
})

# --- ICC via a one-way variance decomposition (reliability of a single rater) ---
grand_mean  <- mean(ratings)
paper_means <- rowMeans(ratings)
ms_between  <- n_reviewers * sum((paper_means - grand_mean)^2) / (n_papers - 1)
ms_within   <- sum((ratings - paper_means)^2) / (n_papers * (n_reviewers - 1))
icc <- (ms_between - ms_within) /
       (ms_between + (n_reviewers - 1) * ms_within)

# --- Cohen's kappa on categorical verdicts for reviewers 1 and 2 ---
# Threshold continuous ratings into reject / R&R / accept.
to_verdict <- function(x) cut(x, breaks = c(-Inf, -0.5, 0.5, Inf),
                              labels = c("reject", "RnR", "accept"))
v1 <- to_verdict(ratings[, 1])
v2 <- to_verdict(ratings[, 2])

tab  <- table(v1, v2)
p_o  <- sum(diag(tab)) / sum(tab)                       # observed agreement
p_e  <- sum(rowSums(tab) * colSums(tab)) / sum(tab)^2   # chance agreement
kappa <- (p_o - p_e) / (1 - p_e)

cat(sprintf("Single-rater ICC: %.3f\n", icc))
#> Single-rater ICC: 0.301
cat(sprintf("Observed agreement (reviewers 1 & 2): %.3f\n", p_o))
#> Observed agreement (reviewers 1 & 2): 0.415
cat(sprintf("Cohen's kappa (reviewers 1 & 2):      %.3f\n", kappa))
#> Cohen's kappa (reviewers 1 & 2):      0.154

The simulation reproduces the empirical regularity: with realistic severity and noise, single-rater reliability is modest and categorical \(\kappa\) lands in the fair-to-moderate band of Landis and Koch (1977). Raising sigma_b or sigma_e relative to sigma_q drives both statistics toward zero, which is the formal statement of a familiar complaint—reviewing is noisier than authors wish—and the justification for the institutional responses that surround it: multiple reviewers to average down noise, an AE to discount severity, and the editor’s argument-weighing synthesis to recover signal that any single verdict discards.

68.7 Replication and Transparency

A measurement procedure is only as credible as the evidence it screens, and the top marketing journals have moved decisively toward research transparency: authors of empirical papers are increasingly required to deposit data and analysis code so that published results can be reproduced. From the reviewer’s standpoint this adds a verification task—do the supplied materials regenerate the reported numbers?— and from the field’s standpoint it raises the validity bar in Table 68.1 by making the third criterion, validity, externally checkable rather than taken on faith. Reviewers should treat a transparency package as part of the submission and weigh its completeness alongside the manuscript’s arguments.

68.8 The Credibility Revolution and Research Integrity

The transparency requirements above did not arise in a vacuum. They are the operational residue of roughly a decade of self-examination, often called the credibility revolution or the replication crisis, during which psychology, economics, and the quantitative social sciences (marketing among them) reconsidered how much of their published record they actually believed. For a reviewer this history is not a curiosity. It supplies the specific checklist of failure modes a referee is now expected to screen for, and it explains why the field added the institutional machinery (preregistration, registered reports, data deposit, forensic re-analysis) that surrounds the classical review process described above. This section maps that terrain. It begins with the evidence that the literature was less reliable than assumed, turns to the unresolved statistical debates the crisis reopened, then to the forensic methods and integrity cases that data transparency made possible, and closes with a neutral tour of the marketing-specific methodological disagreements a referee will encounter and should be able to frame fairly.

68.8.1 The Replication and Credibility Crisis

The intellectual spark is usually credited to Simmons, Nelson, and Simonsohn (2011), False-Positive Psychology, which showed by simulation and live demonstration that exploiting researcher degrees of freedom, the undisclosed choices about exclusions, conditions, covariates, and stopping rules that every analysis involves, can manufacture statistically significant support for essentially any hypothesis, driving the false-positive rate far above the nominal five percent even when no effect exists. The authors proposed disclosure remedies (the well-known “twenty-one-word solution”). Gelman and Loken (2014) sharpened the point with the metaphor of the garden of forking paths: a researcher need not consciously run many analyses to inflate error rates, because the single analysis actually run is contingent on the data observed, so the p-value is uninterpretable without a pre-specified plan. The disagreement here is not about whether the mechanism is real (it provably is) but about how much of the published literature it has actually contaminated and how aggressively the field should respond, a distinction worth preserving when a reviewer raises the concern.

The empirical question, how replicable is the literature, was addressed by large coordinated projects. The Open Science Collaboration (2015) replicated 100 psychology studies and found that roughly one-third to one-half met various replication criteria, with replication effect sizes about half the size of the originals. Camerer et al. (2018) replicated social-science experiments published in Nature and Science and successfully reproduced about thirteen of twenty-one, again with attenuated effects. A non-replication is ambiguous, however: it can mean the original was a false positive, or that the replication differed in some hidden but consequential way. The Many Labs projects (Klein et al. 2014, 2018) addressed this by running identical protocols across many sites, finding that most effects were robust across samples while a minority varied, evidence that contextual heterogeneity is real but does not by itself account for most non-replications. Figure 68.3 arranges these landmarks. The honest synthesis, which a reviewer can state as genuine convergence rather than as one camp’s opinion, is that replication rates sit materially below 100 percent and effect sizes shrink on replication, while the mix of causes remains debated.

flowchart LR
    A[2011<br/>False-Positive<br/>Psychology] --> B[2014<br/>Garden of<br/>forking paths]
    B --> C[2015<br/>Open Science<br/>Collaboration]
    C --> D[2014-18<br/>Many Labs<br/>1 and 2]
    D --> E[2017<br/>Reproducibility<br/>manifesto]
    E --> F[2018+<br/>Preregistration &<br/>registered reports]

Figure 68.3: A timeline of the credibility revolution. The diagnostic work (forking paths, false positives) precedes the large replication audits, which in turn motivate the institutional reforms (preregistration, registered reports, open-data mandates) that now shape the review process.

If selective reporting and forking paths are the disease, preregistration (committing to hypotheses and analyses before seeing the data) and registered reports (peer review of the design before results are known, with in-principle acceptance) are the leading proposed cures (Nosek et al. 2018; Munafò et al. 2017). Registered reports relocate the review process itself: the referee evaluates the question and the design rather than the result, which removes the incentive to reverse-engineer a clean story from a messy dataset. Early outcome evidence is encouraging; Soderberg et al. (2021) found that registered reports scored higher on rigor and quality dimensions than comparison papers. The mature position is not that preregistration is mandatory for all work but that it disciplines confirmatory inference without displacing exploratory analysis, provided the exploratory portion is labeled honestly. The journal-level counterpart is the Transparency and Openness Promotion (TOP) framework (Nosek et al. 2015), which the data-deposit norms in the preceding section instantiate.

68.8.2 Statistical-Significance Debates

The crisis reopened a foundational and still-unresolved argument about statistical significance itself. The American Statistical Association took the unusual step of issuing a formal statement (Wasserstein and Lazar 2016) clarifying what a p-value is not: it is not the probability that the null hypothesis is true, and statistical significance is not practical importance. There is broad agreement on these negative claims and continued disagreement on what, if anything, should replace null-hypothesis significance testing; the ASA’s own follow-up (Wasserstein, Schirm, and Lazar 2019) surveyed the options and pointedly declined to endorse a single replacement.

Three reform proposals now compete, and a referee should recognize them as live rather than settled. The redefine camp (Benjamin et al. 2018) proposes lowering the default threshold for new discoveries from .05 to .005, treating the interval between as merely suggestive. The justify camp (Lakens et al. 2018) replies that no single threshold fits all contexts and that researchers should transparently justify the alpha they adopt given the costs of each kind of error. The abandon camp (Amrhein, Greenland, and McShane 2019; McShane et al. 2019), the last of which is led in part by a marketing and statistics scholar, argues for retiring the dichotomous significant-versus-not verdict entirely and treating p-values as one continuous piece of evidence among many. Table 68.3 summarizes the three positions. The accurate summary for the book is near-universal agreement that a bright-line p < .05 is misused, alongside genuine disagreement on the remedy.

Table 68.3: Three competing responses to the misuse of statistical significance. The field has not converged on a winner.

Position	Core proposal	Representative work
Redefine	Lower the default discovery threshold to .005; .005–.05 is “suggestive”	Benjamin et al. (2018)
Justify	No universal threshold; justify alpha from error costs and study goals	Lakens et al. (2018)
Abandon	Retire the significant/not-significant dichotomy; treat p as continuous evidence	Amrhein, Greenland, and McShane (2019); McShane et al. (2019)

Running underneath this is the longer-standing Bayesian versus frequentist question: whether to quantify evidence with p-values and confidence intervals or with posterior probabilities and Bayes factors. The Bayesian side (Rouder et al. 2009; Wagenmakers et al. 2018) emphasizes that Bayes factors can quantify evidence for a null hypothesis and sidestep several p-value pathologies, while the frequentist-reform side is well represented by the ASA materials above. Modern practice increasingly treats the two as complementary and reports both; the chapters on Bayesian modeling develop the machinery a reviewer would check.

68.8.3 Forensic Data Analysis and Integrity

Data transparency did more than let reviewers regenerate reported numbers; it enabled a class of forensic methods that probe published results for arithmetic and statistical impossibilities, and it made possible the independent re-analyses associated with the blog Data Colada. The methods matter more than any individual case, so we lead with the toolkit, summarized in 1.

Table 68.4: The forensic toolkit for screening published quantitative results. SPRITE, a related raw-data reconstruction method, circulates as a preprint and is cited here by lineage to GRIM rather than by a fabricated identifier.

Tool	What it checks	Source
p-curve	Whether a set of significant findings reflects a real effect or selective reporting, from the distribution of significant p-values	Simonsohn, Nelson, and Simmons (2014)
Specification-curve / multiverse	Runs and displays all defensible analytic choices rather than one, exposing fragility	Steegen et al. (2016); Simonsohn, Simmons, and Nelson (2020)
GRIM	Whether a reported mean is arithmetically possible given the sample size and integer responses	Brown and Heathers (2017)
Excess-uniformity checks	Whether improbable similarity across conditions signals fabricated data	Simonsohn (2013)

P-curve (Simonsohn, Nelson, and Simmons 2014) infers whether a body of significant findings reflects a genuine effect or selective reporting by examining the shape of the distribution of significant p-values. Specification-curve or multiverse analysis (Steegen et al. 2016; Simonsohn, Simmons, and Nelson 2020) formalizes running and reporting all reasonable analytic specifications rather than a single favored one, so a reader can see whether a result depends on a particular defensible choice. GRIM (Brown and Heathers 2017) performs a deceptively simple arithmetic check: given a sample size and integer-valued responses, only certain means are possible, and a reported mean that is not on that list signals an error. And Simonsohn (2013) showed that excessive uniformity or improbable similarity across conditions can reveal fabricated data from the statistics alone, an argument for routine data posting. These are exactly the checks a transparency package now lets a referee, or a later reader, perform.

The remainder of this subsection describes a small number of prominent integrity cases. It does so in a deliberately measured, court-reporter register: every factual claim is anchored only to public records, namely retraction notices, official investigation outcomes, and dated published analyses. Where matters remain contested or are the subject of litigation, that is stated as such, and no allegation is asserted as settled fact. The pedagogical point is what the forensic methods and data transparency can reveal, not a verdict on any person.

Consider first the field-experiment literature on honesty pledges. A 2012 article in PNAS (Shu et al. 2012) reported that signing an honesty declaration before rather than after reporting information reduced dishonesty, using auto-insurance mileage data among other studies. In 2020 a registered set of replications co-authored by several of the original authors did not reproduce the effect (Kristal et al. 2020). In 2021, Data Colada published a post (post number 98, dated August 2021) reporting anomalies in the field-experiment data underlying the 2012 paper, and PNAS subsequently published a retraction of the article (Proceedings of the National Academy of Sciences 2021). The accurate public-record summary is therefore: the 2012 article was retracted in 2021; a replication by several of the original authors had failed to reproduce the effect; and an independent analysis reported anomalies in the underlying field data. Public reporting has attributed the questioned data to a particular source, but because responsibility is contested and related matters have touched litigation, the book assigns no blame.

A second sequence concerns several papers co-authored by one researcher. A 2015 article in Psychological Science (Gino, Kouchaki, and Galinsky 2015) was retracted in 2023, a fact recorded in the journal’s published retraction notice (Psychological Science 2023). In 2023 Data Colada published a four-part series (“Data Falsificada,” posts 109 through 112) reporting anomalies across several co-authored papers; the relevant institution conducted an internal investigation, and several papers were retracted. Defamation litigation related to these allegations has been reported in public sources, and elements remain contested. The book states only that the matter has been the subject of litigation and that the allegations remain contested, without characterizing the merits. The durable lesson for a reviewer is structural rather than personal: arithmetic checks like GRIM, distributional checks like p-curve, and above all the routine posting of raw data are what make independent verification possible, which is precisely why the transparency requirements in the preceding section have teeth.

68.8.4 A Tour of Marketing’s Methodological Debates

A marketing referee evaluates work across an unusually wide methodological range, and several long-running debates recur often enough that a reviewer should be able to frame each one neutrally, recognizing both the contribution and its standard critique rather than mechanically penalizing a method some camp dislikes. Each debate below is stated in two or three sentences and forwarded to the chapter where it properly lives; the home chapters develop them in full.

Structural versus reduced-form modeling. Should a study estimate the primitives of a behavioral or economic model (structural), enabling counterfactual and welfare simulation, or prioritize clean identification of a specific causal effect through design and quasi-experiment (reduced-form)? The economics credibility revolution (Angrist and Pischke 2010) sharpened design-based inference, while structural advocates (Chintagunta et al. 2006) argue structural models answer policy questions reduced-form cannot, with internal cautions on both sides about weak instruments (Rossi 2014) and undisciplined structural assumptions (Mazzeo et al. 2006). The causal-inference and structural-modeling chapters develop the trade-off.

Formative versus reflective measurement. In reflective measurement the latent construct causes its indicators; in formative measurement the indicators compose the construct, so dropping one changes its meaning, and misspecifying the direction biases structural estimates (Bollen and Lennox 1991; Jarvis, MacKenzie, and Podsakoff 2003; Diamantopoulos, Riefler, and Roth 2008). A skeptical strand questions whether formative “constructs” are latent variables at all and proposes content-validity-first alternatives (Rossiter 2002). This debate is developed where constructs and variables are distinguished (Chapter 3) and in the measurement-scales chapter.

The PLS-SEM controversy. Partial Least Squares path modeling is popular in some marketing subfields and distrusted in others; critics argue it lacks a true latent-variable model and yields biased estimates with unreliable fit heuristics (Rönkkö and Evermann 2013), while proponents reply that the critiques target outdated versions and that PLS suits prediction-oriented, formative, and small-sample work (Henseler et al. 2014). This is an unusually clean critique-plus-rejoinder exchange in one journal, developed in the structural-models chapter.

Measuring advertising returns. Can observational attribution methods recover the causal effect of advertising on sales, or are large randomized experiments effectively required? Evidence that ad effects are tiny relative to sales variance (Lewis and Rao 2015), that brand-keyword paid search delivered near-zero incremental value in a large field experiment (Blake, Nosko, and Tadelis 2015), and that observational methods diverge substantially from experimental benchmarks (Gordon et al. 2019) pushes toward experiments for incrementality, while attribution modeling (Li and Kannan 2014; Berman 2018) remains useful for allocation under stated assumptions. The advertising and causal-inference chapters develop the convergence.

The nudge-effectiveness debate. How large and reliable are choice-architecture “nudge” effects in aggregate? A 2021 meta-analysis reported a moderate positive average effect (Mertens et al. 2022), while reanalyses argued that the effect is not distinguishable from zero once publication bias is modeled (Maier et al. 2022) and that heterogeneity precludes expecting uniform effects (Szaszi et al. 2022). This is as much a dispute about meta-analytic method as about nudges, and it is developed in the nudges chapter.

Mediation-analysis critiques. Bootstrap and Baron-Kenny-style mediation from observational designs is ubiquitous in consumer research, but critics argue that without experimental manipulation of the mediator a mediation claim is causally unidentified (Bullock, Green, and Ha 2010), while a constructive-reform position urges better practice (manipulated mediators, moderation-of-process, sensitivity analysis) rather than abandonment (Pieters 2017). The causal-inference chapter develops the identification requirements.

The common-method-bias debate. When predictor and outcome come from the same respondent at one time, shared method can inflate observed correlations; one position catalogs this as a serious, pervasive threat requiring design and statistical remedies (Podsakoff et al. 2003), while another argues the blanket assumption is overstated and that some popular post-hoc corrections do more harm than good (Spector 2006). The surveys and measurement-scales chapters develop the practical safeguards.

Online-panel and MTurk data quality. Crowdsourced samples democratized data collection but raised concerns about non-naivete, inattentive responding, and bots; early validations argued the data are reliable and diverse (Buhrmester, Kwang, and Gosling 2011; Paolacci and Chandler 2014), later work documented worker non-naivete (Chandler, Mueller, and Paolacci 2014), and a recent exchange disputes whether alarming invalid-response rates reflect inherent platform failure (Webb and Tangney 2022) or remediable design and screening choices (Keith and McKay 2024). The surveys and data chapters develop the concrete safeguards (attention checks, geolocation and bot screening, preregistered exclusion rules).

68.9 Key Takeaways

Reviewing is reciprocal labor that also buys proximity to the frontier and calibration about what makes a paper publishable; record it so it counts.
The editorial pipeline (Figure 68.1) is a sequence of gates—plagiarism and duplicate checks, editor suitability screening, AE assignment, reviewer synthesis—and most attrition happens at desk screening and the first round.
A publishable paper must simultaneously offer novelty, relevance, and validity; validity failures are usually fatal because they cannot be reframed away.
A good report is short, numbered, separates major from minor comments, and closes with a roadmap; it supplies arguments, never a verdict count.
Review is a measurement problem (Equation 68.1): rater severity and noise make verdicts stochastic, single-rater reliability (Equation 68.2) modest, and categorical agreement (Equation 68.3) typically only fair-to-moderate—which is exactly why journals use multiple reviewers and an argument-weighing editor rather than a vote.
The credibility revolution (Section 68.8) supplies the modern referee’s failure-mode checklist: forking paths and researcher degrees of freedom inflate false positives, replication rates sit below 100 percent (Figure 68.3), the significance debate (Table 68.3) remains unresolved, and forensic tools
1. plus open data are what make independent verification, and the integrity record, possible.

Amrhein, Valentin, Sander Greenland, and Blake McShane. 2019. “Scientists Rise up Against Statistical Significance.” Nature 567 (7748): 305–7. https://doi.org/10.1038/d41586-019-00857-9.

Angrist, Joshua D., and Jörn-Steffen Pischke. 2010. “The Credibility Revolution in Empirical Economics: How Better Research Design Is Taking the Con Out of Econometrics.” Journal of Economic Perspectives 24 (2): 3–30. https://doi.org/10.1257/jep.24.2.3.

Benjamin, Daniel J., James O. Berger, Magnus Johannesson, Brian A. Nosek, E.-J. Wagenmakers, Richard Berk, et al. 2018. “Redefine Statistical Significance.” Nature Human Behaviour 2 (1): 6–10. https://doi.org/10.1038/s41562-017-0189-z.

Berman, Ron. 2018. “Beyond the Last Touch: Attribution in Online Advertising.” Marketing Science 37 (5): 771–92. https://doi.org/10.1287/mksc.2018.1104.

Blake, Thomas, Chris Nosko, and Steven Tadelis. 2015. “Consumer Heterogeneity and Paid Search Effectiveness: A Large-Scale Field Experiment.” Econometrica 83 (1): 155–74. https://doi.org/10.3982/ecta12423.

Bollen, Kenneth, and Richard Lennox. 1991. “Conventional Wisdom on Measurement: A Structural Equation Perspective.” Psychological Bulletin 110 (2): 305–14. https://doi.org/10.1037/0033-2909.110.2.305.

Brown, Nicholas J. L., and James A. J. Heathers. 2017. “The GRIM Test: A Simple Technique Detects Numerous Anomalies in the Reporting of Results in Psychology.” Social Psychological and Personality Science 8 (4): 363–69. https://doi.org/10.1177/1948550616673876.

Buhrmester, Michael, Tracy Kwang, and Samuel D. Gosling. 2011. “Amazon’s Mechanical Turk: A New Source of Inexpensive, yet High-Quality, Data?” Perspectives on Psychological Science 6 (1): 3–5. https://doi.org/10.1177/1745691610393980.

Bullock, John G., Donald P. Green, and Shang E. Ha. 2010. “Yes, but What’s the Mechanism? (Don’t Expect an Easy Answer).” Journal of Personality and Social Psychology 98 (4): 550–58. https://doi.org/10.1037/a0018933.

Camerer, Colin F., Anna Dreber, Felix Holzmeister, Teck-Hua Ho, Jürgen Huber, Magnus Johannesson, Michael Kirchler, et al. 2018. “Evaluating the Replicability of Social Science Experiments in Nature and Science Between 2010 and 2015.” Nature Human Behaviour 2 (9): 637–44. https://doi.org/10.1038/s41562-018-0399-z.

Chandler, Jesse, Pam Mueller, and Gabriele Paolacci. 2014. “Nonnaı̈veté Among Amazon Mechanical Turk Workers: Consequences and Solutions for Behavioral Researchers.” Behavior Research Methods 46 (1): 112–30. https://doi.org/10.3758/s13428-013-0365-7.

Chintagunta, Pradeep, Tülin Erdem, Peter E Rossi, and Michel Wedel. 2006. “Structural Modeling in Marketing: Review and Assessment.” Marketing Science 25 (6): 604–16.

Cohen, Jacob. 1960. “A Coefficient of Agreement for Nominal Scales.” Educational and Psychological Measurement 20 (1): 37–46. https://doi.org/10.1177/001316446002000104.

Diamantopoulos, Adamantios, Petra Riefler, and Katharina P. Roth. 2008. “Advancing Formative Measurement Models.” Journal of Business Research 61 (12): 1203–18. https://doi.org/10.1016/j.jbusres.2008.01.009.

Gelman, Andrew, and Eric Loken. 2014. “The Statistical Crisis in Science.” American Scientist 102 (6): 460–65. https://doi.org/10.1511/2014.111.460.

Gino, Francesca, Maryam Kouchaki, and Adam D. Galinsky. 2015. “The Moral Virtue of Authenticity: How Inauthenticity Produces Feelings of Immorality and Impurity.” Psychological Science 26 (7): 983–96. https://doi.org/10.1177/0956797615575277.

Gordon, Brett R., Florian Zettelmeyer, Neha Bhargava, and Dan Chapsky. 2019. “A Comparison of Approaches to Advertising Measurement: Evidence from Big Field Experiments at Facebook.” Marketing Science 38 (2): 193–225. https://doi.org/10.1287/mksc.2018.1135.

Henseler, Jörg, Theo K. Dijkstra, Marko Sarstedt, Christian M. Ringle, Adamantios Diamantopoulos, Detmar W. Straub, David J. Ketchen Jr., Joseph F. Hair, G. Tomas M. Hult, and Roger J. Calantone. 2014. “Common Beliefs and Reality about PLS: Comments on rönkkö and Evermann (2013).” Organizational Research Methods 17 (2): 182–209. https://doi.org/10.1177/1094428114526928.

Jarvis, Cheryl Burke, Scott B. MacKenzie, and Philip M. Podsakoff. 2003. “A Critical Review of Construct Indicators and Measurement Model Misspecification in Marketing and Consumer Research.” Journal of Consumer Research 30 (2): 199–218. https://doi.org/10.1086/376806.

Keith, Melissa G., and Alexander S. McKay. 2024. “Too Anecdotal to Be True? Data Quality Through the Lens of Two MTurk Studies.” Perspectives on Psychological Science. https://doi.org/10.1177/17456916241234328.

Klein, Richard A., Kate A. Ratliff, Michelangelo Vianello, Reginald B. Adams Jr., Štěpán Bahnı́k, Michael J. Bernstein, et al. 2014. “Investigating Variation in Replicability: A ‘Many Labs’ Replication Project.” Social Psychology 45 (3): 142–52. https://doi.org/10.1027/1864-9335/a000178.

Klein, Richard A., Michelangelo Vianello, Fred Hasselman, Byron G. Adams, Reginald B. Adams Jr., Sinan Alper, et al. 2018. “Many Labs 2: Investigating Variation in Replicability Across Samples and Settings.” Advances in Methods and Practices in Psychological Science 1 (4): 443–90. https://doi.org/10.1177/2515245918810225.

Kristal, Ariella S., Ashley V. Whillans, Max H. Bazerman, Francesca Gino, Lisa L. Shu, Nina Mazar, and Dan Ariely. 2020. “Signing at the Beginning Versus at the End Does Not Decrease Dishonesty.” Proceedings of the National Academy of Sciences 117 (13): 7103–7. https://doi.org/10.1073/pnas.1911695117.

Lakens, Daniel, Federico G. Adolfi, Casper J. Albers, Farid Anvari, Matthew A. J. Apps, Shlomo E. Argamon, et al. 2018. “Justify Your Alpha.” Nature Human Behaviour 2 (3): 168–71. https://doi.org/10.1038/s41562-018-0311-x.

Landis, J. Richard, and Gary G. Koch. 1977. “The Measurement of Observer Agreement for Categorical Data.” Biometrics 33 (1): 159. https://doi.org/10.2307/2529310.

Lewis, Randall A., and Justin M. Rao. 2015. “The Unfavorable Economics of Measuring the Returns to Advertising .” The Quarterly Journal of Economics 130 (4): 1941–73. https://doi.org/10.1093/qje/qjv023.

Li, Hongshuang (Alice), and P. K. Kannan. 2014. “Attributing Conversions in a Multichannel Online Marketing Environment: An Empirical Model and a Field Experiment.” Journal of Marketing Research 51 (1): 40–56. https://doi.org/10.1509/jmr.13.0050.

Maier, Maximilian, František Bartoš, T. D. Stanley, David R. Shanks, Adam J. L. Harris, and Eric-Jan Wagenmakers. 2022. “No Evidence for Nudging After Adjusting for Publication Bias.” Proceedings of the National Academy of Sciences 119 (31): e2200300119. https://doi.org/10.1073/pnas.2200300119.

Mazzeo, Michael J., Katja Seim, Mauricio Varela, Sridhar Narayanan, Mark D. Manuszak, et al. 2006. “Marketing Structural Models: ‘Keep It Real’.” Marketing Science 25 (6): 629–32. https://doi.org/10.1287/mksc.1060.0235.

McShane, Blakeley B., David Gal, Andrew Gelman, Christian Robert, and Jennifer L. Tackett. 2019. “Abandon Statistical Significance.” The American Statistician 73 (sup1): 235–45. https://doi.org/10.1080/00031305.2018.1527253.

Mertens, Stephanie, Mario Herberz, Ulf J. J. Hahnel, and Tobias Brosch. 2022. “The Effectiveness of Nudging: A Meta-Analysis of Choice Architecture Interventions Across Behavioral Domains.” Proceedings of the National Academy of Sciences 119 (1): e2107346118. https://doi.org/10.1073/pnas.2107346118.

Munafò, Marcus R., Brian A. Nosek, Dorothy V. M. Bishop, Katherine S. Button, Christopher D. Chambers, Nathalie Percie du Sert, Uri Simonsohn, Eric-Jan Wagenmakers, Jennifer J. Ware, and John P. A. Ioannidis. 2017. “A Manifesto for Reproducible Science.” Nature Human Behaviour 1 (1): 0021. https://doi.org/10.1038/s41562-016-0021.

Nosek, Brian A., George Alter, George C. Banks, Denny Borsboom, Sara D. Bowman, Steven J. Breckler, et al. 2015. “Promoting an Open Research Culture.” Science 348 (6242): 1422–25. https://doi.org/10.1126/science.aab2374.

Nosek, Brian A., Charles R. Ebersole, Alexander C. DeHaven, and David T. Mellor. 2018. “The Preregistration Revolution.” Proceedings of the National Academy of Sciences 115 (11): 2600–2606. https://doi.org/10.1073/pnas.1708274114.

Open Science Collaboration. 2015. “Estimating the Reproducibility of Psychological Science.” Science 349 (6251): aac4716. https://doi.org/10.1126/science.aac4716.

Paolacci, Gabriele, and Jesse Chandler. 2014. “Inside the Turk: Understanding Mechanical Turk as a Participant Pool.” Current Directions in Psychological Science 23 (3): 184–88. https://doi.org/10.1177/0963721414531598.

Pieters, Rik. 2017. “Meaningful Mediation Analysis: Plausible Causal Inference and Informative Communication.” Journal of Consumer Research 44 (3): 692–716. https://doi.org/10.1093/jcr/ucx081.

Podsakoff, Philip M., Scott B. MacKenzie, Jeong-Yeon Lee, and Nathan P. Podsakoff. 2003. “Common Method Biases in Behavioral Research: A Critical Review of the Literature and Recommended Remedies.” Journal of Applied Psychology 88 (5): 879–903. https://doi.org/10.1037/0021-9010.88.5.879.

Proceedings of the National Academy of Sciences. 2021. “Retraction for Shu Et Al., Signing at the Beginning Makes Ethics Salient and Decreases Dishonest Self-Reports in Comparison to Signing at the End.” Proceedings of the National Academy of Sciences 118 (38): e2115397118. https://doi.org/10.1073/pnas.2115397118.

Psychological Science. 2023. “Retraction Notice: The Moral Virtue of Authenticity.” Psychological Science 34 (9): 1063. https://doi.org/10.1177/09567976231187596.

Rönkkö, Mikko, and Joerg Evermann. 2013. “A Critical Examination of Common Beliefs about Partial Least Squares Path Modeling.” Organizational Research Methods 16 (3): 425–48. https://doi.org/10.1177/1094428112474693.

Rossi, Peter E. 2014. “Even the Rich Can Make Themselves Poor: A Critical Examination of IV Methods in Marketing Applications.” Marketing Science 33 (5): 655–72. https://doi.org/10.1287/mksc.2014.0860.

Rossiter, John R. 2002. “The C-OAR-SE Procedure for Scale Development in Marketing.” International Journal of Research in Marketing 19 (4): 305–35. https://doi.org/10.1016/S0167-8116(02)00097-6.

Rouder, Jeffrey N., Paul L. Speckman, Dongchu Sun, Richard D. Morey, and Geoffrey Iverson. 2009. “Bayesian t Tests for Accepting and Rejecting the Null Hypothesis.” Psychonomic Bulletin & Review 16 (2): 225–37. https://doi.org/10.3758/PBR.16.2.225.

Shrout, Patrick E., and Joseph L. Fleiss. 1979. “Intraclass Correlations: Uses in Assessing Rater Reliability.” Psychological Bulletin 86 (2): 420–28. https://doi.org/10.1037/0033-2909.86.2.420.

Shu, Lisa L., Nina Mazar, Francesca Gino, Dan Ariely, and Max H. Bazerman. 2012. “Signing at the Beginning Makes Ethics Salient and Decreases Dishonest Self-Reports in Comparison to Signing at the End.” Proceedings of the National Academy of Sciences 109 (38): 15197–200. https://doi.org/10.1073/pnas.1209746109.

Simmons, Joseph P., Leif D. Nelson, and Uri Simonsohn. 2011. “False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant.” Psychological Science 22 (11): 1359–66. https://doi.org/10.1177/0956797611417632.

Simonsohn, Uri. 2013. “Just Post It: The Lesson from Two Cases of Fabricated Data Detected by Statistics Alone.” Psychological Science 24 (10): 1875–88. https://doi.org/10.1177/0956797613480366.

Simonsohn, Uri, Leif D. Nelson, and Joseph P. Simmons. 2014. “P-Curve: A Key to the File-Drawer.” Journal of Experimental Psychology: General 143 (2): 534–47. https://doi.org/10.1037/a0033242.

Simonsohn, Uri, Joseph P. Simmons, and Leif D. Nelson. 2020. “Specification Curve Analysis.” Nature Human Behaviour 4 (11): 1208–14. https://doi.org/10.1038/s41562-020-0912-z.

Soderberg, Courtney K., Timothy M. Errington, Sarah R. Schiavone, Julia Bottesini, Felix Singleton Thorn, Simine Vazire, Kevin M. Esterling, and Brian A. Nosek. 2021. “Initial Evidence of Research Quality of Registered Reports Compared with the Standard Publishing Model.” Nature Human Behaviour 5 (8): 990–97. https://doi.org/10.1038/s41562-021-01142-4.

Spector, Paul E. 2006. “Method Variance in Organizational Research: Truth or Urban Legend?” Organizational Research Methods 9 (2): 221–32. https://doi.org/10.1177/1094428105284955.

Steegen, Sara, Francis Tuerlinckx, Andrew Gelman, and Wolf Vanpaemel. 2016. “Increasing Transparency Through a Multiverse Analysis.” Perspectives on Psychological Science 11 (5): 702–12. https://doi.org/10.1177/1745691616658637.

Szaszi, Barnabas, Anthony Higney, Aaron Charlton, Andrew Gelman, Ignazio Ziano, Balazs Aczel, Daniel G. Goldstein, David S. Yeager, and Elizabeth Tipton. 2022. “No Reason to Expect Large and Consistent Effects of Nudge Interventions.” Proceedings of the National Academy of Sciences 119 (31): e2200732119. https://doi.org/10.1073/pnas.2200732119.

Wagenmakers, Eric-Jan, Maarten Marsman, Tahira Jamil, Alexander Ly, Josine Verhagen, Jonathon Love, et al. 2018. “Bayesian Inference for Psychology. Part I: Theoretical Advantages and Practical Ramifications.” Psychonomic Bulletin & Review 25 (1): 35–57. https://doi.org/10.3758/s13423-017-1343-3.

Wasserstein, Ronald L., and Nicole A. Lazar. 2016. “The ASA Statement on p-Values: Context, Process, and Purpose.” The American Statistician 70 (2): 129–33. https://doi.org/10.1080/00031305.2016.1154108.

Wasserstein, Ronald L., Allen L. Schirm, and Nicole A. Lazar. 2019. “Moving to a World Beyond ‘p < 0.05’.” The American Statistician 73 (sup1): 1–19. https://doi.org/10.1080/00031305.2019.1583913.

Webb, Margaret A., and June P. Tangney. 2022. “Too Good to Be True: Bots and Bad Data from Mechanical Turk.” Perspectives on Psychological Science. https://doi.org/10.1177/17456916221120027.

The Journal of Marketing documents its review philosophy and its data-transparency requirements in publicly posted editorial statements. The norms described here—turnaround targets, the criteria reviewers weigh, the structure of a report—are paraphrased from those statements and from the broader editorial practice of the marketing top-4 (JM, JMR, Marketing Science, Journal of Consumer Research).↩︎