Consumers do not write literally. A one-star review that reads “Absolutely fantastic — broke on day two, exactly what I wanted” is unambiguous to a human and catastrophic for a sentiment classifier that scores tokens at face value. Sarcasm and the broader family of figurative language—irony, hyperbole, rhetorical questions, understatement—are pervasive in the consumer text that marketing researchers now mine at scale: product reviews, social-media posts, forum threads, and customer-service transcripts. This chapter treats figurative language not as a curiosity but as a measurement problem. In the terms of Chapter 3, sarcasm is not itself a construct; it is a threat to the construct validity of sentiment. The latent construct is the consumer’s true valence, the text is its observable proxy, and figurative language is a systematic distortion in the mapping between them. When the surface form of a sentence systematically inverts or distorts its intended meaning, any pipeline that maps words to valence inherits a bias whose sign and magnitude depend on the prevalence and detectability of the figure. The chapter’s purpose is to make that bias precise, to show where it does and does not threaten the inferences marketers draw from text, and to give the reader a defensible, reproducible way to handle it.
The stakes are concrete. Firms increasingly treat the valence of online word of mouth as a real-time signal of product quality, brand health, and demand (Tirunillai and Tellis 2012; Netzer, Lattin, and Srinivasan 2008), and review valence is correlated with sales (Chevalier and Goolsbee 2003). If sarcasm is more common in negative experiences than positive ones—which the data suggest—then a naïve sentiment model does not merely add noise; it adds signed error that is correlated with the very construct the firm wants to measure. The result is an attenuated or even sign-flipped estimate of how sentiment relates to outcomes. Honest text analytics requires knowing when that happens.
The chapter is deliberately scoped and candid about its limits. Sarcasm detection is an open research problem; no method, human or machine, achieves high reliability on isolated short texts, because sarcasm is fundamentally contextual—it depends on shared knowledge, prosody, and speaker intent that the text alone may not carry. We therefore proceed from intuition to formalism: first defining figurative language and the construct of verbal irony, then modeling sarcasm as a hidden variable that corrupts sentiment measurement, then surveying detection methods with their estimators and failure modes, and finally giving practical guidance—often, the right move is to quantify and bound the bias rather than to chase a brittle classifier.
44.1 What Figurative Language Is
Figurative language is any use of words whose intended meaning departs from their literal, compositional meaning. The departure is not an error: speaker and listener both understand that the literal reading is to be overridden, and the gap between literal and intended meaning is itself the vehicle of communication—it conveys attitude, humor, social bonding, or emphasis that the literal statement could not.
Verbal irony is the use of an utterance whose intended meaning is opposed to, or markedly different from, its literal meaning, with the speaker intending the listener to recognize the opposition. Sarcasm is verbal irony deployed with a critical or contemptuous edge—irony aimed, usually, at a target.
The distinction matters less for measurement than the shared mechanism: a valence reversal between surface and intent. The constructs relevant to consumer text form a small family, summarized in Table 44.1.
Table 44.1: A taxonomy of figurative devices in consumer text, ordered roughly by how badly each corrupts a token-level sentiment score.
Figure
Surface→intent relation
Consumer-text example
Threat to sentiment
Verbal irony / sarcasm
Inversion (often)
“Great, another update that breaks the app.”
Severe: flips sign
Hyperbole
Amplification
“Worst purchase in the history of mankind.”
Moderate: inflates magnitude
Understatement (litotes)
Attenuation
“Not exactly a bargain.”
Moderate: hides magnitude
Rhetorical question
Assertion as question
“Who designed this, a toddler?”
Mild–moderate
Metaphor / idiom
Non-literal mapping
“This blender is a tank.”
Mild: lexicon miss
Two properties make sarcasm the hardest case. First, it is context-dependent: the same sentence (“Love waiting on hold for an hour”) is sincere or sarcastic depending on world knowledge the reader supplies. Second, it is incongruity-based: sarcasm typically juxtaposes a positive surface against a negative situation (or vice versa), so its signature is an internal contradiction rather than any particular vocabulary (Riloff et al. 2013; Joshi, Sharma, and Bhattacharyya 2015). Both properties defeat bag-of-words methods, which discard exactly the contextual and structural information sarcasm encodes.
44.2 Prevalence in Consumer Text
How common is sarcasm in the corpora marketers actually analyze? The honest answer is that prevalence is unknown with precision and heterogeneous across platforms, and any single number should be treated with suspicion. Reported rates vary by an order of magnitude depending on the platform, the labeling protocol, and—crucially— how sarcasm is operationalized (Joshi, Bhattacharyya, and Carman 2017).
A few regularities are robust enough to state, with the appropriate hedges. Sarcasm is more frequent on social platforms than in product reviews: short, public, performative text (microblog posts, comments) rewards wit, whereas a review written to inform a purchase decision is, on average, more literal (Joshi, Bhattacharyya, and Carman 2017). Within reviews, sarcasm concentrates in the tails of the rating distribution and especially in negative experiences, where irony is a common rhetorical strategy for expressing disappointment (Riloff et al. 2013). And sarcasm is bursty: it clusters around product failures, service breakdowns, and brand controversies, the same events that drive firestorms of negative word of mouth (Herhausen et al. 2022). This last point is the one most consequential for marketing: the moments when a firm most needs an accurate read of sentiment are precisely the moments when figurative language is densest.
Why prevalence is so hard to pin down
Three forces conspire. (i) Annotation disagreement: even trained humans agree only moderately on whether an isolated text is sarcastic, so the “ground truth” is itself noisy (see Section 44.5). (ii) Selection in labeled corpora: many benchmark datasets are built by harvesting self-labeled sarcasm (e.g., posts tagged #sarcasm), which over-represents signaled sarcasm and under-represents the deadpan variety that actually fools classifiers (Joshi, Bhattacharyya, and Carman 2017). (iii) Definition drift: studies fold irony, sarcasm, and rhetorical questions together or apart inconsistently. Reported prevalence is therefore a property of the measurement procedure as much as of the population.
44.3 Why Sarcasm Breaks Sentiment Analysis
We now formalize the threat. The goal is to show exactly how figurative language maps into bias in a downstream estimate, so the reader can reason about when it matters.
44.3.1 Sentiment as a measurement model
Let a document \(i\) carry a latent sentiment \(s_i \in \{-1, +1\}\) (negative, positive)—the consumer’s true evaluative stance, the quantity of interest. A sentiment classifier produces an estimate \(\hat{s}_i = f(\mathbf{w}_i)\) from the observed token vector \(\mathbf{w}_i\). A literal classifier reads the surface polarity of the words. Introduce a hidden indicator \(z_i \in \{0,1\}\) for whether the document is sarcastic. The core problem is that sarcasm makes the surface polarity an inverted signal of the intended polarity:
Equation 44.1 is the crux: under sarcasm the most informative surface feature points the wrong way. A literal classifier that achieves accuracy \(1-\epsilon\) on non-sarcastic text and naïvely trusts surface polarity will systematically misclassify sarcastic documents, not merely err at random on them. This is exactly the regime where the choice of sentiment method matters most: benchmarking across methods and datasets shows that transformer-based classifiers substantially outperform lexicons on hard, context-dependent text, but that no method is immune to figurative inversion (Hartmann et al. 2023), and that much consumer sentiment is implicit—carried by discourse and stance rather than polar words—so it eludes surface scoring entirely (Villarroel Ordenes et al. 2017).
44.3.2 The bias this induces
Consider the simplest downstream use: estimating the population share of positive documents, \(\pi = \Pr(s_i = +1)\), by the sample mean of a literal classifier’s labels. Let \(q = \Pr(z_i = 1)\) be the sarcasm prevalence and—capturing the key empirical regularity—suppose sarcasm is concentrated among negative-intent documents, so that a sarcastic document presents a positive surface. Holding the non-sarcastic error aside, the literal estimator’s expected value is
which, stripped to its intuition, says the bias in the estimated positive share is increasing in prevalence \(q\) and in the degree to which sarcasm co-occurs with negative intent. Two consequences follow. First, the bias does not vanish as the sample grows: it is systematic, not sampling error, so more data does not help. Second, because \(q\) is itself larger in negative and high-arousal contexts (Section 44.2), the bias is correlated with the regressors a marketer typically cares about—product, time, campaign—so it contaminates not just levels but comparisons.
The identification problem in one sentence
Sarcasm is a form of non-classical measurement error in the dependent variable that is correlated with the construct being measured. Classical measurement error in \(y\) inflates standard errors but leaves slope estimates unbiased; sarcasm-induced error is non-classical—correlated with true sentiment and often with covariates—so it biases slopes and can reverse signs. No amount of data fixes a measurement model that is wrong about Equation 44.1.
44.3.3 A worked simulation
The following seeded example makes the bias visible. We generate documents with known true sentiment, let a fraction be sarcastic (concentrated in negative documents), and compare a literal classifier’s read of average sentiment against the truth.
Code
set.seed(7)n<-20000# True sentiment: 55% positive, 45% negative.true_pos<-rbinom(n, 1, 0.55)# 1 = positive, 0 = negative# Sarcasm prevalence is HIGHER among negative-intent documents.p_sarc<-ifelse(true_pos==1, 0.03, 0.18)sarcastic<-rbinom(n, 1, p_sarc)# A literal classifier reads SURFACE polarity. For sincere docs the surface# matches intent (with small error); for sarcastic docs it inverts.base_err<-0.05surface_pos<-ifelse(sarcastic==1,1-true_pos, # inversionifelse(runif(n)<base_err, 1-true_pos, true_pos))truth<-mean(true_pos)literal<-mean(surface_pos)cat(sprintf("True positive share: %.3f\n", truth))#> True positive share: 0.550cat(sprintf("Literal estimate: %.3f\n", literal))#> Literal estimate: 0.607cat(sprintf("Bias (literal - truth): %+.3f\n", literal-truth))#> Bias (literal - truth): +0.057# The bias is concentrated where sarcasm is: among truly-negative documents.neg<-true_pos==0cat(sprintf("Share of TRUE-negative docs the literal model calls positive: %.3f\n",mean(surface_pos[neg]==1)))#> Share of TRUE-negative docs the literal model calls positive: 0.224
The literal classifier overstates the positive share, and the error lives almost entirely in the negative tail—exactly the region a brand monitors most closely. A 6-to-1 difference in sarcasm rates between negative and positive documents is enough to move the headline number by several points and to corrupt any regression that conditions on document polarity.
44.4 Detecting Sarcasm
Detection methods fall on a ladder of increasing context. The pedagogical point is that each rung adds information that Equation 44.1 shows is necessary, and each adds its own assumptions and failure modes.
flowchart TB
A["Lexical / surface cues<br/>(polarity contrast, punctuation,<br/>interjections, emoji)"] --> B
B["Incongruity features<br/>(positive phrase in<br/>negative situation)"] --> C
C["Sequence models<br/>(word order, negation scope,<br/>RNN / attention)"] --> D
D["Context-aware neural models<br/>(thread, author history,<br/>pretrained transformers)"] --> E
E["Multimodal / behavioral<br/>(rating-text mismatch,<br/>conversational context)"]
Figure 44.1: A ladder of sarcasm-detection approaches, from context-free lexical cues to context-rich neural models. Each rung adds information the rung below discards; each adds assumptions and new failure modes.
44.4.1 Surface and incongruity features
The earliest and most interpretable approach engineers features that proxy for the internal contradiction sarcasm encodes: a positive sentiment phrase adjacent to a negative situation phrase, intensifiers, scare quotes, ellipses, exclamation patterns, and emoji that clash with the surrounding text (Joshi, Sharma, and Bhattacharyya 2015; Riloff et al. 2013). The estimator is a standard supervised classifier (logistic regression, SVM, or gradient boosting) on these features; the assumption that breaks identification is stationarity of cues—sarcasm markers drift across communities and time, so a model trained on one platform’s conventions degrades on another. Interpretable feature models are nonetheless valuable in marketing precisely because the analyst can audit which cue fired, a property opaque neural models lack.
A useful and lightweight signal specific to reviews is the rating–text mismatch: a five-star rating attached to scathing prose, or a one-star rating attached to glowing prose, is a strong prior for irony or for a mis-clicked rating. This is the bottom-up “behavioral” rung of Figure 44.1 and requires no NLP at all.
Code
set.seed(11)library(dplyr)library(stringr)reviews<-tibble::tibble( stars =c(5, 1, 5, 4, 1, 5), text =c("Absolutely fantastic, broke on day two. Exactly what I wanted.","Worst thing ever. I cannot live without it now.","Solid build, works as described, would buy again.","Does the job, no complaints.","Terrible. Stopped charging after a week and support ignored me.","Great, another 'premium' cable that frays in a month."))# Tiny illustrative polarity lexicon (NOT production-grade).pos<-c("fantastic","great","solid","works","love","premium","wanted","again")neg<-c("broke","worst","terrible","stopped","frays","ignored","cannot")score_text<-function(x){toks<-str_split(str_to_lower(x), "\\W+")[[1]]sum(toks%in%pos)-sum(toks%in%neg)}reviews<-reviews|>mutate( text_polarity =vapply(text, score_text, numeric(1)), star_sign =sign(stars-3), # +1 high, -1 low text_sign =sign(text_polarity), mismatch =star_sign!=0&text_sign!=0&star_sign!=text_sign)reviews|>select(stars, text_polarity, mismatch)#> # A tibble: 6 × 3#> stars text_polarity mismatch#> <dbl> <dbl> <lgl> #> 1 5 1 FALSE #> 2 1 -2 FALSE #> 3 5 3 FALSE #> 4 4 0 FALSE #> 5 1 -3 FALSE #> 6 5 1 FALSE
The flagged rows are candidates for irony (or sloppy rating). Mismatch is a screening device, not a classifier: it has high recall for signaled cases and many false positives (a genuinely mixed review also mismatches), so it routes documents to review rather than relabeling them automatically.
44.4.2 Sequence and context-aware models
Because sarcasm depends on word order (negation scope, the placement of the incongruous phrase) and on context beyond the sentence, the modern literature moves to sequence models—recurrent networks and, dominantly, attention-based transformers that read the whole utterance and, where available, its surrounding thread, the author’s history, and the conversational target (Tay et al. 2018). These models can represent the positive-surface/negative-context incongruity that defines sarcasm, and they consistently outperform feature-engineered baselines on benchmark corpora (Tay et al. 2018; Joshi, Bhattacharyya, and Carman 2017). Marketing has embraced deep text and image models for adjacent tasks—recovering brand perceptions from consumer images (Liu, Dzyabura, and Mizik 2020) and structuring unstructured review text (Büschken and Allenby 2016; Netzer, Lattin, and Srinivasan 2008)—so the tooling is familiar.
Two cautions temper the optimism. First, context is often unavailable at scoring time: a firm’s review corpus may lack the author history or thread structure that makes a benchmark tractable, so reported accuracies do not transfer. Second, the benchmarks themselves are built largely from self-labeled or distantly-supervised data (Section 44.2), which means the models learn to detect signaled sarcasm and remain weak on the deadpan cases that matter most for measurement. The estimator is strong; the identifying data are weak.
44.5 Annotation: The Ground-Truth Problem
Every supervised detector rests on labels, and sarcasm labels are unusually fragile. Because sarcasm is recognized rather than decoded, two competent annotators reading the same isolated text frequently disagree. Quantifying that disagreement is a prerequisite for trusting any downstream number.
The standard instrument is Cohen’s \(\kappa\), the chance-corrected agreement between two raters. With observed agreement \(p_o\) and chance agreement \(p_e\),
where \(p_e = \sum_k \hat p_{1k}\,\hat p_{2k}\) sums the product of the two raters’ marginal rates over label categories \(k\)(Cohen 1960). \(\kappa = 1\) is perfect agreement, \(0\) is chance; conventional (and contested) thresholds read \(0.41\)–\(0.60\) as “moderate” and \(0.61\)–\(0.80\) as “substantial” (Landis and Koch 1977). Sarcasm annotation on isolated short texts routinely lands in the low-to-moderate band, and rises only when annotators are given conversational context (Wallace et al. 2014). The implication is structural: a detector cannot be more reliable than its labels, so a benchmark “accuracy” of 0.85 against ground truth that itself carries \(\kappa
\approx 0.6\) is reporting agreement with a noisy oracle, not with the truth. This human-validation step is not specific to sarcasm; it is the discipline the entire text-as-data and unstructured-data program insists on before any model output is treated as a measurement (Humphreys and Wang 2018; Berger et al. 2020; Balducci and Marinova 2018). When the construct can be captured by a validated dictionary—as Rocklage, Rucker, and Nordgren (2018) do for emotionality, extremity, and valence with the Evaluative Lexicon—the measure inherits that instrument’s reliability; figurative inversion is precisely the case where no fixed dictionary suffices and human-coded context becomes indispensable.
Code
cohen_kappa<-function(r1, r2){tab<-table(r1, r2)n<-sum(tab)p_o<-sum(diag(tab))/np_e<-sum(rowSums(tab)*colSums(tab))/n^2(p_o-p_e)/(1-p_e)}set.seed(3)# 200 texts; "deadpan" sarcasm where raters genuinely disagree.truth<-rbinom(200, 1, 0.25)# latent sarcasm# Each rater detects signaled sarcasm well, deadpan poorly.detect<-function(t)ifelse(t==1,rbinom(length(t), 1, 0.65), # catch ~65% when sarcasticrbinom(length(t), 1, 0.08))# 8% false alarmrater1<-detect(truth)rater2<-detect(truth)cat(sprintf("Cohen's kappa between two annotators: %.2f\n",cohen_kappa(rater1, rater2)))#> Cohen's kappa between two annotators: 0.37
The modest \(\kappa\) is not a flaw in the simulated raters; it is the signature of a construct that is inherently underdetermined by text alone. Honest reporting carries this number forward into the error bars on any sentiment estimate built on top.
44.6 What to Do About It
The practical posture this chapter recommends is bound, don’t pretend. Three strategies, in rough order of cost and rigor.
Quantify the exposure. Before deploying a sentiment pipeline, estimate sarcasm prevalence \(q\) in a hand-labeled sample of the target corpus (not a benchmark), and propagate it into Equation 44.2 to bound how far the headline number can be off. If \(q\) is small and roughly balanced across the comparisons of interest, sarcasm is a footnote; if \(q\) is large and concentrated in the negative tail, it is a threat to the central claim. This costs a few hundred labels and is the single highest-value step.
Screen and route, don’t auto-correct. Use cheap signals—rating–text mismatch, incongruity features—to flag suspect documents and route them to human review or to exclusion, rather than trusting a classifier to flip their labels. Screening trades recall for precision and keeps the analyst in the loop, which matters because a wrongly “corrected” label is worse than a flagged one.
Use context when you have it, and report when you don’t. Where author history, thread structure, or conversational context exist, a context-aware model is worth its cost; where they do not, say so, and treat the resulting estimate as a bound. The worst outcome is a confident sentiment number whose figurative-language exposure was never measured.
Finally, recognize the scope conditions. Sarcasm matters most for fine-grained, document-level valence in negative, high-arousal, public text—brand firestorms, service failures, controversy (Herhausen et al. 2022; Schweidel and Moe 2014). It matters least for aggregate, long-horizon signals where idiosyncratic figurative error averages out and the quantity of interest is a trend, not a label (Tirunillai and Tellis 2012). Knowing which regime one is in is more important than owning the best classifier.
44.7 Key Takeaways
Sarcasm is a measurement problem, not a vocabulary problem. Its signature is a valence inversion between surface and intent (Equation 44.1); bag-of-words methods discard exactly the contextual and structural information needed to catch it.
The induced error is non-classical: correlated with true sentiment and often with covariates, so it biases comparisons and can reverse signs, and more data does not fix it (Equation 44.2).
Prevalence is heterogeneous and hard to measure, higher on social platforms than in reviews, concentrated in negative tails, and bursty around the failures firms most want to monitor.
Detection improves with context (Figure 44.1), but benchmarks rest on self-labeled data and on labels with only moderate inter-annotator agreement (Equation 44.3), so reported accuracies overstate field performance.
The defensible posture is to quantify exposure, screen rather than auto-correct, and report uncertainty—bounding the bias is usually worth more than chasing a brittle classifier.
44.8 Further Reading
The text-analytics foundations this chapter builds on are developed elsewhere in the book: extracting structure and meaning from consumer language (Netzer, Lattin, and Srinivasan 2008; Büschken and Allenby 2016), the relationship between word-of-mouth valence and sales (Chevalier and Goolsbee 2003), firm response to negative posts and firestorms (Herhausen et al. 2022; Proserpio and Zervas 2017), and social listening as a measurement enterprise (Schweidel and Moe 2014). Readers should pair this chapter with the broader treatment of user-generated content and sentiment so that figurative language is handled as one—important—source of measurement error among several.
Balducci, Bitty, and Detelina Marinova. 2018. “Unstructured Data in Marketing.”Journal of the Academy of Marketing Science 46 (4): 557–90. https://doi.org/10.1007/s11747-018-0581-x.
Berger, Jonah, Ashlee Humphreys, Stephan Ludwig, Wendy W. Moe, Oded Netzer, and David A. Schweidel. 2020. “Uniting the Tribes: Using Text for Marketing Insight.”Journal of Marketing 84 (1): 1–25. https://doi.org/10.1177/0022242919873106.
Büschken, Joachim, and Greg M Allenby. 2016. “Sentence-Based Text Analysis for Customer Reviews.”Marketing Science 35 (6): 953–75.
Chevalier, Judith, and Austan Goolsbee. 2003. “Measuring Prices and Price Competition Online: Amazon. Com and BarnesandNoble. Com.”Quantitative Marketing and Economics 1 (2): 203–22.
Cohen, Jacob. 1960. “A Coefficient of Agreement for Nominal Scales.”Educational and Psychological Measurement 20 (1): 37–46. https://doi.org/10.1177/001316446002000104.
Hartmann, Jochen, Mark Heitmann, Christian Siebert, and Christina Schamp. 2023. “More Than a Feeling: Accuracy and Application of Sentiment Analysis.”International Journal of Research in Marketing 40 (1): 75–87. https://doi.org/10.1016/j.ijresmar.2022.05.005.
Herhausen, Dennis, Lauren Grewal, Krista Hill Cummings, Anne L. Roggeveen, Francisco Villarroel Ordenes, and Dhruv Grewal. 2022. “EXPRESS: Complaint Deescalation Strategies on Social Media.”Journal of Marketing, August, 002224292211199. https://doi.org/10.1177/00222429221119977.
Humphreys, Ashlee, and Rebecca Jen-Hui Wang. 2018. “Automated Text Analysis for Consumer Research.”Journal of Consumer Research 44 (6): 1274–1306. https://doi.org/10.1093/jcr/ucx104.
Joshi, Aditya, Pushpak Bhattacharyya, and Mark J. Carman. 2017. “Automatic Sarcasm Detection: A Survey.”ACM Computing Surveys 50 (5): 1–22. https://doi.org/10.1145/3124420.
Joshi, Aditya, Vinita Sharma, and Pushpak Bhattacharyya. 2015. “Harnessing Context Incongruity for Sarcasm Detection.” In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), 757–62. https://doi.org/10.3115/v1/p15-2124.
Landis, J. Richard, and Gary G. Koch. 1977. “The Measurement of Observer Agreement for Categorical Data.”Biometrics 33 (1): 159. https://doi.org/10.2307/2529310.
Liu, Liu, Daria Dzyabura, and Natalie Mizik. 2020. “Visual Listening In: Extracting Brand Image Portrayed on Social Media.”Marketing Science 39 (4): 669–86. https://doi.org/10.1287/mksc.2020.1226.
Netzer, Oded, James M Lattin, and Vikram Srinivasan. 2008. “A Hidden Markov Model of Customer Relationship Dynamics.”Marketing Science 27 (2): 185–204.
Proserpio, Davide, and Georgios Zervas. 2017. “Online Reputation Management: Estimating the Impact of Management Responses on Consumer Reviews.”Marketing Science 36 (5): 645–65. https://doi.org/10.1287/mksc.2017.1043.
Riloff, Ellen, Ashequl Qadir, Prafulla Surve, Lalindra De Silva, Nathan Gilbert, and Ruihong Huang. 2013. “Sarcasm as Contrast Between a Positive Sentiment and Negative Situation.” In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP), 704–14. https://doi.org/10.18653/v1/d13-1066.
Rocklage, Matthew D., Derek D. Rucker, and Loran F. Nordgren. 2018. “The Evaluative Lexicon 2.0: The Measurement of Emotionality, Extremity, and Valence in Language.”Behavior Research Methods 50 (4): 1327–44. https://doi.org/10.3758/s13428-017-0975-6.
Schweidel, David A, and Wendy W Moe. 2014. “Listening in on Social Media: A Joint Model of Sentiment and Venue Format Choice.”Journal of Marketing Research 51 (4): 387–402.
Tay, Yi, Anh Tuan Luu, Siu Cheung Hui, and Jian Su. 2018. “Reasoning with Sarcasm by Reading in-Between.” In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1010–20. https://doi.org/10.18653/v1/p18-1093.
Tirunillai, Seshadri, and Gerard J. Tellis. 2012. “Does Chatter Really Matter? Dynamics of User-Generated Content and Stock Performance.”Marketing Science 31 (2): 198–215. https://doi.org/10.1287/mksc.1110.0682.
Villarroel Ordenes, Francisco, Stephan Ludwig, Ko de Ruyter, Dhruv Grewal, and Martin Wetzels. 2017. “Unveiling What Is Written in the Stars: Analyzing Explicit, Implicit, and Discourse Patterns of Sentiment in Social Media.”Journal of Consumer Research 43 (6): 875–94. https://doi.org/10.1093/jcr/ucw070.
Wallace, Byron C., Do Kook Choe, Laura Kertz, and Eugene Charniak. 2014. “Humans Require Context to Infer Ironic Intent (so Computers Probably Do, Too).” In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 512–16. https://doi.org/10.3115/v1/p14-2084.
# Sarcasm and Figurative Language {#sec-sarcasm}Consumers do not write literally. A one-star review that reads "Absolutelyfantastic — broke on day two, exactly what I wanted" is unambiguous to a humanand catastrophic for a sentiment classifier that scores tokens at face value.*Sarcasm* and the broader family of *figurative language*—irony, hyperbole,rhetorical questions, understatement—are pervasive in the consumer text thatmarketing researchers now mine at scale: product reviews, social-media posts,forum threads, and customer-service transcripts. This chapter treats figurativelanguage not as a curiosity but as a *measurement problem*. In the terms of@sec-construct-vs-variable, sarcasm is not itself a construct; it is a threat tothe construct validity of sentiment. The latent construct is the consumer's truevalence, the text is its observable proxy, and figurative language is asystematic distortion in the mapping between them. When the surface formof a sentence systematically inverts or distorts its intended meaning, anypipeline that maps words to valence inherits a bias whose sign and magnitudedepend on the prevalence and detectability of the figure. The chapter's purposeis to make that bias precise, to show where it does and does not threaten theinferences marketers draw from text, and to give the reader a defensible,reproducible way to handle it.The stakes are concrete. Firms increasingly treat the valence of online word ofmouth as a real-time signal of product quality, brand health, and demand[@tirunillai2012; @netzer2008hidden], and review valence is correlated with sales[@chevalier2003measuring]. If sarcasm is more common in negative experiences thanpositive ones—which the data suggest—then a naïve sentiment model does not merelyadd noise; it adds *signed* error that is correlated with the very construct thefirm wants to measure. The result is an attenuated or even sign-flipped estimateof how sentiment relates to outcomes. Honest text analytics requires knowing whenthat happens.The chapter is deliberately scoped and candid about its limits. Sarcasm detectionis an open research problem; no method, human or machine, achieves high reliabilityon isolated short texts, because sarcasm is fundamentally *contextual*—it dependson shared knowledge, prosody, and speaker intent that the text alone may not carry.We therefore proceed from intuition to formalism: first defining figurativelanguage and the construct of verbal irony, then modeling sarcasm as a hiddenvariable that corrupts sentiment measurement, then surveying detection methods withtheir estimators and failure modes, and finally giving practical guidance—often,the right move is to *quantify and bound* the bias rather than to chase a brittleclassifier.## What Figurative Language IsFigurative language is any use of words whose intended meaning departs from theirliteral, compositional meaning. The departure is not an error: speaker and listenerboth understand that the literal reading is to be overridden, and the gap betweenliteral and intended meaning is itself the vehicle of communication—it conveysattitude, humor, social bonding, or emphasis that the literal statement could not.> **Verbal irony** is the use of an utterance whose intended meaning is opposed to,> or markedly different from, its literal meaning, with the speaker intending the> listener to recognize the opposition. **Sarcasm** is verbal irony deployed with a> critical or contemptuous edge—irony aimed, usually, at a target.The distinction matters less for measurement than the shared mechanism: a *valencereversal* between surface and intent. The constructs relevant to consumer text forma small family, summarized in @tbl-figures.| Figure | Surface→intent relation | Consumer-text example | Threat to sentiment ||---|---|---|---|| Verbal irony / sarcasm | Inversion (often) | "Great, another update that breaks the app." | Severe: flips sign || Hyperbole | Amplification | "Worst purchase in the history of mankind." | Moderate: inflates magnitude || Understatement (litotes) | Attenuation | "Not exactly a bargain." | Moderate: hides magnitude || Rhetorical question | Assertion as question | "Who designed this, a toddler?" | Mild–moderate || Metaphor / idiom | Non-literal mapping | "This blender is a tank." | Mild: lexicon miss |: A taxonomy of figurative devices in consumer text, ordered roughly by how badly each corrupts a token-level sentiment score. {#tbl-figures}Two properties make sarcasm the hardest case. First, it is **context-dependent**:the same sentence ("Love waiting on hold for an hour") is sincere or sarcasticdepending on world knowledge the reader supplies. Second, it is **incongruity-based**:sarcasm typically juxtaposes a positive surface against a negative situation (orvice versa), so its signature is an *internal contradiction* rather than anyparticular vocabulary [@riloff2013sarcasm; @joshi2015incongruity].Both properties defeat bag-of-words methods, which discard exactly the contextualand structural information sarcasm encodes.## Prevalence in Consumer Text {#sec-prevalence}How common is sarcasm in the corpora marketers actually analyze? The honest answeris that prevalence is *unknown with precision* and *heterogeneous across platforms*,and any single number should be treated with suspicion. Reported rates vary by anorder of magnitude depending on the platform, the labeling protocol, and—crucially—how sarcasm is operationalized [@joshi2017sarcasm].A few regularities are robust enough to state, with the appropriate hedges. Sarcasmis **more frequent on social platforms than in product reviews**: short, public,performative text (microblog posts, comments) rewards wit, whereas a review writtento inform a purchase decision is, on average, more literal[@joshi2017sarcasm]. Withinreviews, sarcasm concentrates in the **tails of the rating distribution** andespecially in **negative** experiences, where irony is a common rhetorical strategyfor expressing disappointment[@riloff2013sarcasm]. Andsarcasm is **bursty**: it clusters around product failures, service breakdowns, andbrand controversies, the same events that drive firestorms of negative word of mouth[@herhausen2022]. This last point is the one most consequential for marketing: themoments when a firm most needs an accurate read of sentiment are precisely themoments when figurative language is densest.::: {.callout-note}## Why prevalence is so hard to pin downThree forces conspire. (i) *Annotation disagreement*: even trained humans agree onlymoderately on whether an isolated text is sarcastic, so the "ground truth" is itselfnoisy (see @sec-annotation). (ii) *Selection in labeled corpora*: many benchmarkdatasets are built by harvesting self-labeled sarcasm (e.g., posts tagged`#sarcasm`), which over-represents *signaled* sarcasm and under-represents thedeadpan variety that actually fools classifiers [@joshi2017sarcasm].(iii) *Definition drift*: studies fold irony, sarcasm, and rhetorical questionstogether or apart inconsistently. Reported prevalence is therefore a property of themeasurement procedure as much as of the population.:::## Why Sarcasm Breaks Sentiment AnalysisWe now formalize the threat. The goal is to show *exactly* how figurative languagemaps into bias in a downstream estimate, so the reader can reason about when itmatters.### Sentiment as a measurement modelLet a document $i$ carry a latent sentiment $s_i \in \{-1, +1\}$ (negative,positive)—the consumer's true evaluative stance, the quantity of interest. Asentiment classifier produces an estimate $\hat{s}_i = f(\mathbf{w}_i)$ from theobserved token vector $\mathbf{w}_i$. A *literal* classifier reads the surfacepolarity of the words. Introduce a hidden indicator $z_i \in \{0,1\}$ for whetherthe document is sarcastic. The core problem is that sarcasm makes the surfacepolarity an inverted signal of the intended polarity:$$\text{surface polarity}(\mathbf{w}_i) =\begin{cases}s_i, & z_i = 0 \quad (\text{literal}) \\-\,s_i, & z_i = 1 \quad (\text{sarcastic, inverting})\end{cases}$$ {#eq-inversion}@eq-inversion is the crux: under sarcasm the *most informative* surface featurepoints the wrong way. A literal classifier that achieves accuracy $1-\epsilon$ onnon-sarcastic text and naïvely trusts surface polarity will *systematicallymisclassify* sarcastic documents, not merely err at random on them. This is exactlythe regime where the choice of sentiment method matters most: benchmarking acrossmethods and datasets shows that transformer-based classifiers substantially outperformlexicons on hard, context-dependent text, but that *no* method is immune to figurativeinversion [@hartmann2023feeling], and that much consumer sentiment is *implicit*—carriedby discourse and stance rather than polar words—so it eludes surface scoring entirely[@villarroelordenes2017stars].### The bias this inducesConsider the simplest downstream use: estimating the population share of positivedocuments, $\pi = \Pr(s_i = +1)$, by the sample mean of a literal classifier'slabels. Let $q = \Pr(z_i = 1)$ be the sarcasm prevalence and—capturing the keyempirical regularity—suppose sarcasm is concentrated among negative-intentdocuments, so that a sarcastic document presents a *positive* surface. Holding thenon-sarcastic error aside, the literal estimator's expected value is$$\mathbb{E}[\hat{\pi}] \;=\; \pi \;+\; q\,\Pr(s_i = -1 \mid z_i = 1)\,\Pr(z_i=1\mid s_i=-1)\big/\!\cdots,$$ {#eq-bias}which, stripped to its intuition, says the bias in the estimated positive share is**increasing in prevalence $q$ and in the degree to which sarcasm co-occurs withnegative intent**. Two consequences follow. First, the bias does not vanish as thesample grows: it is *systematic*, not sampling error, so more data does not help.Second, because $q$ is itself larger in negative and high-arousal contexts(@sec-prevalence), the bias is *correlated with the regressors* a marketer typicallycares about—product, time, campaign—so it contaminates not just levels but*comparisons*.::: {.callout-important}## The identification problem in one sentenceSarcasm is a form of **non-classical measurement error in the dependent variablethat is correlated with the construct being measured**. Classical measurement errorin $y$ inflates standard errors but leaves slope estimates unbiased; sarcasm-inducederror is non-classical—correlated with true sentiment and often with covariates—soit biases slopes and can reverse signs. No amount of data fixes a measurement modelthat is wrong about @eq-inversion.:::### A worked simulationThe following seeded example makes the bias visible. We generate documents withknown true sentiment, let a fraction be sarcastic (concentrated in negativedocuments), and compare a literal classifier's read of average sentiment against thetruth.```{r}#| label: sarcasm-bias-sim#| message: false#| warning: falseset.seed(7)n <-20000# True sentiment: 55% positive, 45% negative.true_pos <-rbinom(n, 1, 0.55) # 1 = positive, 0 = negative# Sarcasm prevalence is HIGHER among negative-intent documents.p_sarc <-ifelse(true_pos ==1, 0.03, 0.18)sarcastic <-rbinom(n, 1, p_sarc)# A literal classifier reads SURFACE polarity. For sincere docs the surface# matches intent (with small error); for sarcastic docs it inverts.base_err <-0.05surface_pos <-ifelse( sarcastic ==1,1- true_pos, # inversionifelse(runif(n) < base_err, 1- true_pos, true_pos))truth <-mean(true_pos)literal <-mean(surface_pos)cat(sprintf("True positive share: %.3f\n", truth))cat(sprintf("Literal estimate: %.3f\n", literal))cat(sprintf("Bias (literal - truth): %+.3f\n", literal - truth))# The bias is concentrated where sarcasm is: among truly-negative documents.neg <- true_pos ==0cat(sprintf("Share of TRUE-negative docs the literal model calls positive: %.3f\n",mean(surface_pos[neg] ==1)))```The literal classifier overstates the positive share, and the error lives almostentirely in the negative tail—exactly the region a brand monitors most closely. A6-to-1 difference in sarcasm rates between negative and positive documents is enoughto move the headline number by several points and to corrupt any regression thatconditions on document polarity.## Detecting SarcasmDetection methods fall on a ladder of increasing context. The pedagogical point isthat each rung adds information that @eq-inversion shows is necessary, and each addsits own assumptions and failure modes.```{mermaid}%%| label: fig-detection-ladder%%| fig-cap: "A ladder of sarcasm-detection approaches, from context-free lexical cues to context-rich neural models. Each rung adds information the rung below discards; each adds assumptions and new failure modes."flowchart TB A["Lexical / surface cues<br/>(polarity contrast, punctuation,<br/>interjections, emoji)"] --> B B["Incongruity features<br/>(positive phrase in<br/>negative situation)"] --> C C["Sequence models<br/>(word order, negation scope,<br/>RNN / attention)"] --> D D["Context-aware neural models<br/>(thread, author history,<br/>pretrained transformers)"] --> E E["Multimodal / behavioral<br/>(rating-text mismatch,<br/>conversational context)"]```### Surface and incongruity featuresThe earliest and most interpretable approach engineers features that proxy for theinternal contradiction sarcasm encodes: a positive sentiment phrase adjacent to anegative situation phrase, intensifiers, scare quotes, ellipses, exclamationpatterns, and emoji that clash with the surrounding text[@joshi2015incongruity; @riloff2013sarcasm].The estimator is a standard supervised classifier (logistic regression, SVM, orgradient boosting) on these features; the assumption that breaks identification is**stationarity of cues**—sarcasm markers drift across communities and time, so amodel trained on one platform's conventions degrades on another. Interpretablefeature models are nonetheless valuable in marketing precisely because the analystcan audit *which* cue fired, a property opaque neural models lack.A useful and lightweight signal specific to reviews is the **rating–text mismatch**:a five-star rating attached to scathing prose, or a one-star rating attached toglowing prose, is a strong prior for irony or for a mis-clicked rating. This is thebottom-up "behavioral" rung of @fig-detection-ladder and requires no NLP at all.```{r}#| label: mismatch-flag#| message: false#| warning: falseset.seed(11)library(dplyr)library(stringr)reviews <- tibble::tibble(stars =c(5, 1, 5, 4, 1, 5),text =c("Absolutely fantastic, broke on day two. Exactly what I wanted.","Worst thing ever. I cannot live without it now.","Solid build, works as described, would buy again.","Does the job, no complaints.","Terrible. Stopped charging after a week and support ignored me.","Great, another 'premium' cable that frays in a month." ))# Tiny illustrative polarity lexicon (NOT production-grade).pos <-c("fantastic","great","solid","works","love","premium","wanted","again")neg <-c("broke","worst","terrible","stopped","frays","ignored","cannot")score_text <-function(x) { toks <-str_split(str_to_lower(x), "\\W+")[[1]]sum(toks %in% pos) -sum(toks %in% neg)}reviews <- reviews |>mutate(text_polarity =vapply(text, score_text, numeric(1)),star_sign =sign(stars -3), # +1 high, -1 lowtext_sign =sign(text_polarity),mismatch = star_sign !=0& text_sign !=0& star_sign != text_sign )reviews |>select(stars, text_polarity, mismatch)```The flagged rows are candidates for irony (or sloppy rating). Mismatch is a*screening* device, not a classifier: it has high recall for signaled cases and manyfalse positives (a genuinely mixed review also mismatches), so it routes documents toreview rather than relabeling them automatically.### Sequence and context-aware modelsBecause sarcasm depends on word order (negation scope, the placement of theincongruous phrase) and on context beyond the sentence, the modern literature movesto sequence models—recurrent networks and, dominantly, attention-based transformersthat read the whole utterance and, where available, its surrounding thread, theauthor's history, and the conversational target[@tay2018sarcasm].These models can represent the positive-surface/negative-context incongruity thatdefines sarcasm, and they consistently outperform feature-engineered baselines onbenchmark corpora [@tay2018sarcasm; @joshi2017sarcasm].Marketing has embraced deep text and image models for adjacent tasks—recoveringbrand perceptions from consumer images [@liu2020] and structuring unstructuredreview text [@buschken2016sentence; @netzer2008hidden]—so the tooling is familiar.Two cautions temper the optimism. First, **context is often unavailable** at scoringtime: a firm's review corpus may lack the author history or thread structure thatmakes a benchmark tractable, so reported accuracies do not transfer. Second, thebenchmarks themselves are built largely from **self-labeled or distantly-supervised**data (@sec-prevalence), which means the models learn to detect *signaled* sarcasm andremain weak on the deadpan cases that matter most for measurement. The estimator isstrong; the identifying data are weak.## Annotation: The Ground-Truth Problem {#sec-annotation}Every supervised detector rests on labels, and sarcasm labels are unusually fragile.Because sarcasm is recognized rather than decoded, two competent annotators readingthe same isolated text frequently disagree. Quantifying that disagreement is aprerequisite for trusting any downstream number.The standard instrument is **Cohen's $\kappa$**, the chance-corrected agreementbetween two raters. With observed agreement $p_o$ and chance agreement $p_e$,$$\kappa = \frac{p_o - p_e}{1 - p_e},$$ {#eq-07-kappa}where $p_e = \sum_k \hat p_{1k}\,\hat p_{2k}$ sums the product of the two raters'marginal rates over label categories $k$ [@Cohen_1960]. $\kappa = 1$ is perfectagreement, $0$ is chance; conventional (and contested) thresholds read $0.41$–$0.60$as "moderate" and $0.61$–$0.80$ as "substantial" [@Landis_1977]. Sarcasm annotationon isolated short texts routinely lands in the low-to-moderate band, and rises onlywhen annotators are given conversational context [@wallace2014context].The implication is structural: a detector cannot be more reliable than its labels, soa benchmark "accuracy" of 0.85 against ground truth that itself carries $\kappa\approx 0.6$ is reporting agreement with a noisy oracle, not with the truth. Thishuman-validation step is not specific to sarcasm; it is the discipline the entiretext-as-data and unstructured-data program insists on before any model output is treatedas a measurement [@humphreys2018automated; @berger2020uniting; @balducci2018unstructured].When the construct can be captured by a validated dictionary—as @rocklage2018evaluative dofor emotionality, extremity, and valence with the Evaluative Lexicon—the measure inheritsthat instrument's reliability; figurative inversion is precisely the case where no fixeddictionary suffices and human-coded context becomes indispensable.```{r}#| label: kappa-demo#| message: false#| warning: falsecohen_kappa <-function(r1, r2) { tab <-table(r1, r2) n <-sum(tab) p_o <-sum(diag(tab)) / n p_e <-sum(rowSums(tab) *colSums(tab)) / n^2 (p_o - p_e) / (1- p_e)}set.seed(3)# 200 texts; "deadpan" sarcasm where raters genuinely disagree.truth <-rbinom(200, 1, 0.25) # latent sarcasm# Each rater detects signaled sarcasm well, deadpan poorly.detect <-function(t) ifelse(t ==1,rbinom(length(t), 1, 0.65), # catch ~65% when sarcasticrbinom(length(t), 1, 0.08)) # 8% false alarmrater1 <-detect(truth)rater2 <-detect(truth)cat(sprintf("Cohen's kappa between two annotators: %.2f\n",cohen_kappa(rater1, rater2)))```The modest $\kappa$ is not a flaw in the simulated raters; it is the signature of aconstruct that is inherently underdetermined by text alone. Honest reporting carriesthis number forward into the error bars on any sentiment estimate built on top.## What to Do About ItThe practical posture this chapter recommends is *bound, don't pretend*. Threestrategies, in rough order of cost and rigor.**Quantify the exposure.** Before deploying a sentiment pipeline, estimate sarcasmprevalence $q$ in a hand-labeled sample of the *target* corpus (not a benchmark), andpropagate it into @eq-bias to bound how far the headline number can be off. If $q$ issmall and roughly balanced across the comparisons of interest, sarcasm is a footnote;if $q$ is large and concentrated in the negative tail, it is a threat to the centralclaim. This costs a few hundred labels and is the single highest-value step.**Screen and route, don't auto-correct.** Use cheap signals—rating–text mismatch,incongruity features—to flag suspect documents and route them to human review or toexclusion, rather than trusting a classifier to flip their labels. Screening tradesrecall for precision and keeps the analyst in the loop, which matters because awrongly "corrected" label is worse than a flagged one.**Use context when you have it, and report when you don't.** Where author history,thread structure, or conversational context exist, a context-aware model is worth itscost; where they do not, say so, and treat the resulting estimate as a *bound*. Theworst outcome is a confident sentiment number whose figurative-language exposure wasnever measured.Finally, recognize the **scope conditions**. Sarcasm matters most for fine-grained,document-level valence in negative, high-arousal, public text—brand firestorms,service failures, controversy [@herhausen2022; @schweidel2014listening]. It mattersleast for aggregate, long-horizon signals where idiosyncratic figurative erroraverages out and the quantity of interest is a trend, not a label[@tirunillai2012]. Knowing which regime one is in is more important than owning thebest classifier.## Key Takeaways- **Sarcasm is a measurement problem, not a vocabulary problem.** Its signature is a valence inversion between surface and intent (@eq-inversion); bag-of-words methods discard exactly the contextual and structural information needed to catch it.- The induced error is **non-classical**: correlated with true sentiment and often with covariates, so it biases comparisons and can reverse signs, and *more data does not fix it* (@eq-bias).- **Prevalence is heterogeneous and hard to measure**, higher on social platforms than in reviews, concentrated in negative tails, and bursty around the failures firms most want to monitor.- Detection improves with **context** (@fig-detection-ladder), but benchmarks rest on self-labeled data and on labels with only **moderate inter-annotator agreement** (@eq-07-kappa), so reported accuracies overstate field performance.- The defensible posture is to **quantify exposure, screen rather than auto-correct, and report uncertainty**—bounding the bias is usually worth more than chasing a brittle classifier.## Further ReadingThe text-analytics foundations this chapter builds on are developed elsewhere in thebook: extracting structure and meaning from consumer language [@netzer2008hidden;@buschken2016sentence], the relationship between word-of-mouth valence and sales[@chevalier2003measuring], firm response to negative posts and firestorms[@herhausen2022; @proserpio2017], and social listening as a measurement enterprise[@schweidel2014listening]. Readers should pair this chapter with the broadertreatment of user-generated content and sentiment so that figurative language ishandled as one—important—source of measurement error among several.