Marketing has a soundtrack. A call-center recording, a podcast ad read, a voice-assistant exchange, a thirty-second spot scored to a swelling string line, the timbre of a spokesperson’s voice: each is an acoustic artifact that carries information about what a person said and, just as importantly, how they said it. For most of the discipline’s history that information was inaccessible at scale. Audio could be played back and judged by a human coder, but it could not be turned into the kind of numbers that enter a demand system, a choice model, or a regression. What has changed is that a waveform can now be transcribed, characterized, and embedded automatically across millions of recordings, so the acoustic layer of marketing becomes data in the same operational sense that text (Chapter 43) and images (Chapter 45) already are.
The intellectual move is the one this part has made repeatedly. An unstructured artifact, here a one-dimensional pressure signal sampled in time, is mapped into a feature vector that a marketing theory can speak about. As with text and images, the representation is lossy and learned, and the central econometric caution carries over without modification: an acoustic feature is a generated regressor whose measurement error can correlate with the outcome it is meant to explain, so a naive regression confounds measurement with effect. Audio adds one distinctive opportunity, which organizes much of this chapter. A spoken utterance has two separable channels: the lexical channel (the words, recoverable by transcription and then handed to the entire text-as-data pipeline) and the paralinguistic channel (pitch, loudness, rhythm, voice quality, the acoustic carrier of emotion and emphasis). The distinction between what is said and how it is said is the payoff of treating audio as its own modality rather than as a noisy path to a transcript.
The chapter proceeds from signal to construct to application. It first surveys where audio enters marketing. It then fixes what a digital sound is and develops the classic acoustic features that have powered speech and music analysis for decades, with a runnable demonstration that synthesizes a voice-like waveform and extracts a spectrogram, mel-frequency cepstral coefficients, a fundamental frequency contour, and a spectral centroid from it. With those primitives in hand it turns to automatic speech recognition and speech analytics, to vocal emotion and prosody, and to music and sonic branding. It closes with the realities of production practice and a short look at the frontier.
46.1 Where Audio Enters Marketing
Audio reaches marketing through several distinct streams, each with its own data characteristics and its own measurement question.
The largest and least exploited stream is the service and sales voice channel. Contact centers and outbound sales organizations record enormous volumes of calls, historically for compliance and quality assurance, and these recordings hold both a transcript and an acoustic signal. The transcript answers what was discussed: the reason for the call, the objections raised, whether a resolution or a sale occurred. The acoustic signal answers how the exchange unfolded: whether the customer’s voice tightened with frustration, whether the agent’s pace and warmth tracked or diverged from the customer’s, where in the call the emotional tenor turned. Vocal-tone analytics on this channel promises to predict churn, satisfaction, conversion, and agent effectiveness from signals a transcript cannot see. This is the channel the unstructured-data program was built to exploit (Balducci and Marinova 2018), and it is the explicit subject of Balducci et al. (2025), whose guide to understanding customer–firm conversations walks through the full path from raw call audio to validated measures—segmentation, transcription, speaker diarization, acoustic-prosodic feature extraction, and the joint modeling of the lexical and paralinguistic channels—and is the natural methodological anchor for this chapter. Even so, a clean top-tier marketing study analyzing raw call-center acoustic features at scale with a verifiable identification strategy remains comparatively rare, which is precisely why the channel is a first-class opportunity. That voice carries economically consequential signal is already established outside marketing: Mayew and Venkatachalam (2012) show that vocal markers of managerial affect on earnings calls predict future firm performance, and Hobson, Mayew, and Venkatachalam (2012) detect financial misreporting from acoustic cues of cognitive dissonance in executive speech—evidence that the paralinguistic channel encodes information markets pay to recover.
A second stream is voice-assistant and voice-commerce interaction. Smart speakers and phone assistants (Alexa, Siri, Google Assistant) turn shopping, search, and brand contact into spoken dialogue, which raises questions both about the consumer’s relationship with the device and about the persuasive properties of the assistant’s synthesized voice. Schweitzer, Belk, Jordan, and Ortner (2019, Journal of Marketing Management, doi:10.1080/0267257X.2019.1596970) frame the consumer side: users build relationships with voice-controlled devices that range from treating the device as a servant to treating it as a friend or even a master, and that relational frame shapes adoption and reliance. On the persuasion side, Flavián, Akdim, and Casaló (2023, Psychology & Marketing, doi:10.1002/mar.21765) show experimentally that voice-based recommendations from a virtual assistant can move consumer behavior more than text-based online reviews, with credibility and usefulness mediating the effect. Voice commerce makes the acoustic properties of a machine’s voice a marketing-mix variable. Adoption of in-home assistants is governed by perceived usefulness, enjoyment, and social presence (McLean and Osei-Frimpong 2019), and sustained use feeds back into brand engagement (McLean, Osei-Frimpong, and Barhorst 2021); the services-marketing agenda these devices open—from the convenience calculus of voice shopping to the design of an assistant’s persona—is mapped by Klaus and Zaichkowsky (2020) and Klaus and Zaichkowsky (2022). The frontier is affective and strategic: Mari, Mandelli, and Algesheimer (2024) find that prosodically empathic assistants lift consumer responses in voice commerce, while Mari, Mandelli, and Algesheimer (2025) document an efficiency–autonomy trade-off when consumers delegate shopping to a voice agent. And because the recommendation now issues from a machine, the “word-of-machine” effect applies directly: consumers weight human versus algorithmic recommenders differently for utilitarian versus hedonic goals (Longoni and Cian 2022), a contingency every voice-commerce deployment inherits.
A third stream is advertising and branded audio: radio and television ad audio, podcast host-read and programmatic insertions, and the music and sound design of spots across media. Here the questions are classic advertising questions (recall, attitude, persuasion) asked of an acoustic stimulus, and the literature on music in advertising, reviewed below, is the oldest and most developed thread in the modality.
A fourth stream is sonic branding: the deliberate design of a brand’s auditory identity, from the multi-second sound logo (the Intel chime, the Netflix “ta-dum,” a startup tone) to the chosen voice of a brand’s spokes-character or assistant. Sonic branding treats sound the way visual branding treats a logo or color palette, as a recognizable, ownable, affect-laden asset.
Across all four streams the same architecture recurs and is worth adopting as a default. Run automatic speech recognition to obtain a transcript and feed it to the text pipeline; separately extract acoustic-prosodic features as their own signal; and model the two channels jointly, never collapsing how-it-was-said into what-was-said. The remainder of the chapter equips each half of that architecture.
46.2 What a Sound Is, and Its Classic Features
Formally, a digital audio signal is a function of time sampled on a uniform grid. A continuous sound pressure wave \(x(t)\) is recorded as a sequence
where \(f_s\) is the sampling rate in samples per second (hertz) and the number of samples \(N = f_s \cdot T\) for a recording of duration \(T\). The Nyquist-Shannon sampling theorem requires \(f_s\) to be at least twice the highest frequency of interest, which is why telephone speech is adequately captured at 8 kHz (human speech intelligibility lives mostly below 4 kHz) while music is sampled at 44.1 kHz to preserve content up to roughly 22 kHz. Each sample \(x[n]\) is a single amplitude value, the analogue of a pixel, and like a pixel it is almost meaningless in isolation. What carries information is structure across many samples: periodicity, spectral content, and how both evolve in time.
Because the informative structure is spectral and time-varying, the raw waveform is almost never modeled directly. The foundational transform is the short-time Fourier transform, which slides a short analysis window across the signal and computes, in each window, the distribution of energy across frequency. The squared magnitude of this transform, displayed as time on the horizontal axis, frequency on the vertical, and energy as intensity, is the spectrogram, the single most important visualization in audio analysis. A spectrogram of speech reveals the horizontal bands of energy (formants) that distinguish vowels, the broadband bursts of consonants, and the vertical striations of the glottal pulse train whose spacing encodes pitch.
From the spectrogram, a small set of classic features has organized speech and music analysis for four decades.
The fundamental frequency\(f_0\) is the rate at which the vocal folds vibrate, the acoustic correlate of perceived pitch. Its average level, its range, and the shape of its contour over an utterance carry speaker identity, emphasis, question-versus statement intonation, and a large share of vocal emotion.
Mel-frequency cepstral coefficients (MFCCs) are the workhorse representation of the spectral envelope, that is, of the overall shape of the spectrum that distinguishes one vowel or timbre from another independent of pitch. MFCCs were introduced by Davis and Mermelstein (1980, IEEE Transactions on Acoustics, Speech, and Signal Processing, doi:10.1109/TASSP.1980.1163420) and remain ubiquitous. They are computed by warping the spectrum onto the perceptual mel scale (which spaces frequencies the way the human ear resolves them, finely at low frequencies and coarsely at high), taking logarithms of the energy in a bank of mel filters, and applying a discrete cosine transform to decorrelate the result. The first dozen or so coefficients compactly summarize timbre and are the standard input to speaker, emotion, and (historically) speech recognition.
Spectral shape features summarize the distribution of energy across frequency in a single number each. The most common is the spectral centroid, the amplitude-weighted mean frequency, which corresponds to perceived brightness: a bright, sharp voice or a cymbal has a high centroid, a dull or muffled sound a low one. Related measures include spectral spread, rolloff, and flatness.
Energy and rhythm features (the short-time energy or loudness contour, the speaking rate, the distribution of pause durations) capture intensity and timing, which carry arousal, emphasis, and conversational dynamics.
Voice-quality features (jitter, the cycle-to-cycle variability in \(f_0\); shimmer, the variability in amplitude; and the harmonics-to-noise ratio) quantify the roughness, breathiness, or strain of a voice, and are sensitive markers of stress and emotion.
These features are interpretable, cheap, and defensible, the audio analogue of the hand-engineered color and composition features of Chapter 45. They were designed by acoustic phoneticians with explicit perceptual motivation, and they remain the right tool when the construct is well understood, the sample is modest, or interpretability is paramount. Table 46.1 maps each family of acoustic features to the marketing construct it most directly measures and to a representative study, so the demonstration that follows can be read as instrumenting constructs, not merely transforming a signal.
Table 46.1: Classic acoustic feature families, the perceptual and marketing constructs they carry, and representative work. The features are the audio analogue of hand-engineered visual features.
Feature family
Perceptual correlate
Marketing construct
Fundamental frequency (f0)
Pitch, intonation
Emotion, persuasion, gender
MFCCs (spectral envelope)
Timbre, vowel/voice identity
Speaker/emotion ID
Spectral shape (centroid, rolloff)
Brightness
Persuasiveness, affect
Energy & rhythm (loudness, rate, pauses)
Arousal, emphasis, pacing
Engagement, conversational dynamics
Voice quality (jitter, shimmer, HNR)
Roughness, breathiness, strain
Stress, sincerity, emotion
The move from these hand-engineered descriptors to a reproducible measurement workflow is the subject of a small methods literature aimed squarely at business and marketing researchers. Hildebrand et al. (2020) lay the conceptual foundations of voice analytics for business research—what the acoustic channel can and cannot measure, which features to extract, and how to validate them—and Busquet, Efthymiou, and Hildebrand (2024) stress-test that pipeline “in the wild,” showing that the validity and predictive accuracy of common acoustic features depend materially on the recording device, a direct caution for any study that mixes telephone, smart-speaker, and studio audio. Crucially, this line ships usable tooling rather than prescriptions alone.
Replication resources: voice analytics
The acoustic-feature pipeline of Hildebrand et al. (2020) and Busquet, Efthymiou, and Hildebrand (2024) is released as the open R package voiceR(Busquet and Hildebrand 2023), available on CRAN, which automates batch extraction, normalization, and summary of the features in Table 46.1 and is a turnkey way to reproduce the kind of analysis the demonstration below builds by hand. For raw signal processing the maintained R packages tuneR and seewave (used below) and the Python librosa, pyAudioAnalysis, and the standardized openSMILE feature sets remain the reference toolchain.
46.2.1 A Runnable Demonstration: From Waveform to Features
The demonstration below synthesizes a short voice-like waveform with known properties and then extracts the four classic features just described, using the maintained R packages tuneR and seewave. The signal is deliberately simple and fully simulated: a glottal source modeled as a fundamental plus harmonics, with a slow vibrato so the pitch contour is non-trivial, shaped by an amplitude envelope and contaminated with a little noise. Because the construction is known, the extracted features can be checked against ground truth, which is the point of a simulation. Everything here runs on synthetic data and uses only signal-processing functions, not deep models.
Code
# Maintained CRAN packages for audio analysis in R.library(tuneR)# Wave objects, MFCCs (melfcc)library(seewave)# spectrogram, fundamental frequency, spectral measuresset.seed(52)
Code
# ---- Synthesize a voice-like waveform with KNOWN properties ----fs<-16000# sampling rate (Hz); 16 kHz is standard for speechdur<-1.2# secondst<-seq(0, dur, by =1/fs)# time gridN<-length(t)f0_mean<-140# mean fundamental frequency (Hz), a low-ish voicevibrato<-6*sin(2*pi*4*t)# +/- 6 Hz pitch wobble at 4 Hz (a gentle vibrato)f0_t<-f0_mean+vibrato# instantaneous f0 over time# Instantaneous phase = 2*pi * integral of f0(t); cumulative sum approximates it.phase<-2*pi*cumsum(f0_t)/fs# A glottal-like source: fundamental + decaying harmonics (gives a vowel-ish timbre).harm_amp<-c(1.0, 0.6, 0.35, 0.2, 0.12)sig<-rep(0, N)for(hinseq_along(harm_amp)){sig<-sig+harm_amp[h]*sin(h*phase)}# Amplitude envelope (fade in/out) and a little additive noise (breathiness).env<-sin(pi*seq_len(N)/N)^0.5sig<-env*sig+rnorm(N, 0, 0.03)# Wrap as a normalized 16-bit mono Wave object (tuneR's core data structure).wav<-Wave(left =sig, samp.rate =fs, bit =16)wav<-normalize(wav, unit ="16")wav#> #> Wave Object#> Number of Samples: 19201#> Duration (seconds): 1.2#> Samplingrate (Hertz): 16000#> Channels (Mono/Stereo): Mono#> PCM (integer format): TRUE#> Bit (8/16/24/32/64): 16
The object wav is now an ordinary Wave, the same structure tuneR would produce from reading a .wav file off disk, so every step that follows applies identically to real recordings.
Code
# ---- Spectrogram: energy across time (x) and frequency (y) ----# wl is the analysis-window length in samples; ovlp the percent overlap.seewave::spectro(wav, f =fs, wl =512, ovlp =75, collevels =seq(-60, 0, 5), flim =c(0, 2), main ="Spectrogram (synthetic voice)")
Spectrogram of the synthesized voice-like signal. Horizontal bands are the harmonics of the fundamental; their gentle waviness is the simulated vibrato.
The spectrogram shows a stack of horizontal bands, the fundamental near 140 Hz and its harmonics above it, each rippling slightly with the 4 Hz vibrato that was built in. This is exactly the harmonic structure a sustained vowel produces.
Code
# ---- Fundamental frequency contour ----# seewave::fund returns a matrix of (time, f0-in-kHz); we convert to Hz.ff<-seewave::fund(wav, f =fs, wl =512, ovlp =75, fmax =600, plot =FALSE)f0_hz<-ff[, 2]*1000# kHz -> Hzf0_hz<-f0_hz[is.finite(f0_hz)&f0_hz>0]plot(f0_hz, type ="l", xlab ="Frame", ylab ="f0 (Hz)", main ="Estimated fundamental frequency")abline(h =f0_mean, col ="red", lty =2)# the true mean we synthesized
Estimated fundamental-frequency (pitch) contour. The estimate hovers near the 140 Hz that was synthesized and traces the vibrato.
The estimated pitch track sits close to the dashed line at the true mean of 140 Hz, recovering a property that was put into the signal by construction. In a real recording this same contour is what carries question intonation, emphasis, and much of the emotional signal developed below.
Code
# ---- Mel-frequency cepstral coefficients (timbre) ----# tuneR::melfcc returns a matrix: one row per analysis frame, one column per coefficient.mf<-melfcc(wav, sr =fs, numcep =13, wintime =0.025, hoptime =0.010)# 25 ms windows, 10 ms hopcat("MFCC matrix dimensions (frames x coefficients):",paste(dim(mf), collapse =" x "), "\n")#> MFCC matrix dimensions (frames x coefficients): 118 x 13# Per-utterance summary: the mean of each coefficient across frames.# Aggregating frame-level features to a fixed-length vector is the standard way# to turn a variable-length recording into one row of a design matrix.mfcc_means<-round(colMeans(mf), 3)names(mfcc_means)<-paste0("c", 0:12)mfcc_means#> c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 #> 94.973 4.071 15.124 8.855 1.362 -5.616 -8.119 -5.853 -0.553 3.384 4.121 #> c11 c12 #> 1.548 -2.113
The MFCC matrix has one row per short analysis frame and thirteen columns. A recording of any length therefore yields a variable number of frames, and the last lines show the standard device for handling that: summarize the frames (here by their mean) into a single fixed-length vector that can become one row of a regression or classifier design matrix. The first coefficient c0 reflects overall log-energy; the rest encode the spectral envelope, that is, the timbre.
Code
# ---- Spectral centroid (perceived brightness) ----# Compute the mean spectrum, then the amplitude-weighted mean frequency.ms<-seewave::meanspec(wav, f =fs, wl =512, plot =FALSE)freq_khz<-ms[, 1]# frequency axis in kHzamp<-ms[, 2]# relative amplitudecentroid_hz<-sum(freq_khz*amp)/sum(amp)*1000# seewave::specprop packages the same and related descriptors; 'cent' is the# centroid in Hz. We report both to show they agree.sp<-seewave::specprop(ms, f =fs)round(c(centroid_manual_hz =centroid_hz, centroid_specprop_hz =as.numeric(sp$cent)), 1)#> centroid_manual_hz centroid_specprop_hz #> 915.6 915.6
The spectral centroid, computed both by hand as the amplitude-weighted mean frequency and via seewave::specprop, lands in the low hundreds of hertz, consistent with a low-pitched voice whose energy is concentrated in the fundamental and its first few harmonics. A brighter sound (more high-frequency energy) would push this number up. With four lines each, this demonstration has turned a one-dimensional waveform into a pitch contour, a timbre vector, and a brightness scalar, the same primitives that feed every application that follows.
46.3 Automatic Speech Recognition and Speech Analytics
The lexical channel of audio is unlocked by automatic speech recognition (ASR), the task of transcribing speech to text. ASR is the bridge that lets the entire text-as-data apparatus of Chapter 43, that is, topic models, sentiment and stance classifiers, embeddings, and large-language-model extraction, operate on spoken marketing data. The history of ASR runs from the MFCC-plus-hidden-Markov systems of the 1980s and 1990s, through hybrid deep-neural-network systems in the 2010s, to today’s end-to-end sequence models trained on very large corpora. The contemporary reference point is OpenAI’s Whisper, an encoder-decoder transformer trained on a large multilingual, multitask corpus that produces robust transcripts across accents, noise, and domains and has become a common default for research pipelines. Wav2vec 2.0 represents the parallel self-supervised line: it learns speech representations from unlabeled audio that can then be fine-tuned for recognition with relatively little labeled data. These deep-ASR systems are best treated here as conceptual building blocks rather than something to run inline: they require substantial compute and large pretrained weights, and unlike the signal-processing demonstration above they are not lightweight enough to execute in a book’s render pipeline. The practical posture is to call them as a service or a preprocessing step, then verify their output, because transcription error is a measurement error that propagates into every downstream text feature.
Several capabilities sit alongside transcription and together constitute speech analytics. Speaker diarization segments a recording into “who spoke when,” essential for separating the agent from the customer in a service call or the host from the guest in a podcast. Speaker identification and verification match a voice to a known identity, underpinning voice biometrics and fraud detection. Keyword and intent spotting flags compliance-relevant phrases, product mentions, or competitor references without a full transcript. Language and accent identification route and segment multilingual corpora. In a production speech- analytics stack these run before or alongside ASR, and their errors, like ASR’s, are generated-feature errors that the downstream analysis must respect.
The strategic value of the lexical channel in marketing is that it converts the spoken word, previously locked in un-searchable audio, into the same representations that have made review text and social-media text so productive. A corpus of sales-call transcripts can be mined for the objections that precede a lost deal, the language that precedes a renewal, or the product features customers actually ask about, exactly as a corpus of reviews is mined for product attributes. And the style of the transcribed words, not only their content, carries outcome-relevant signal: Packard and Berger (2021) show that the concreteness of frontline-employee language raises customer satisfaction and purchase, recovered from the lexical channel of service interactions, while Melumad, Inman, and Pham (2019) show that the device a message is composed on systematically shifts its emotionality, a reminder that the transcript’s linguistic features are themselves shaped by the capture context. But transcription alone discards the paralinguistic channel, and that channel is where audio earns its place as a distinct modality.
46.4 Vocal Emotion and Prosody: How It Is Said
The acoustic-prosodic channel carries information that the words do not. The same sentence, “that’s just great,” can be sincere or sarcastic (Chapter 44), and the difference lives almost entirely in pitch, timing, and voice quality rather than in the lexical content. Prosody, the melody and rhythm of speech, comprises the \(f_0\) contour, the loudness contour, speaking rate and pausing, and voice quality. These are the features the demonstration above extracted, and they are the substrate of vocal emotion.
The scientific anchor is Scherer (2003, Speech Communication, doi:10.1016/S0167-6393(02)00084-5), whose review of vocal emotion communication lays out both the production side (how emotions systematically modulate acoustic parameters: anger raises and broadens pitch and energy, sadness lowers and narrows them, and so on) and the recognition side (how listeners decode those parameters). This is the theory that licenses treating acoustic features as measures of an emotional construct rather than as arbitrary numbers. On the engineering side, speech emotion recognition (SER) builds classifiers that map acoustic features to emotion labels. El Ayadi, Kamel, and Karray (2011, Pattern Recognition, doi:10.1016/j.patcog.2010.09.020) survey the classic pipeline (hand-engineered prosodic, spectral, and voice-quality features feeding a classifier) and the recurring difficulties: emotion is continuous and context-dependent, labeled datasets are small and often acted rather than spontaneous, and accuracy degrades sharply when a model trained on one corpus or language meets another. Modern SER replaces or augments the hand-engineered features with learned audio embeddings of the wav2vec family, but the validity problems the survey names (label quality, acted-versus-spontaneous mismatch, cross-corpus generalization) are exactly the construct-validity concerns (Chapter 3) that this book insists on, and they do not disappear when the feature extractor becomes a neural network.
Prosody also carries persuasion, not only emotion. Zoghaib (2019, Recherche et Applications en Marketing, doi:10.1177/2051570719828687) manipulates a speaker’s voice along acoustic dimensions and finds that lower-pitched, smoother (less rough), and duller (less bright) voices are the more persuasive, with effects that interact with speaker gender. This is a direct demonstration that the very features the demonstration computed, \(f_0\) level and spectral centroid and voice-quality roughness, are not bookkeeping quantities but marketing-relevant levers that shape consumer response to the same words. The managerial reading is that a brand choosing a spokesperson, a voice actor for an ad, or a timbre for a synthesized assistant is choosing a persuasion parameter, and that choice can now be measured acoustically rather than left to intuition.
The discipline’s standing caution applies with force here. Acoustic-emotion features are generated regressors, and the model that generates them was trained on data whose emotional labels, speaker demographics, and recording conditions may differ systematically from the marketing setting of interest. A vocal-frustration score that is more accurate for some accents than others, or for studio audio than for telephone audio, injects a bias that can correlate with the outcome. The remedy is the one used throughout this part: validate the generated feature against human judgment on a held-out sample from the target domain before trusting it in a downstream model.
46.5 Music and Sonic Branding
Music is the oldest and most developed thread in audio-as-marketing, and the research is unusually clear that music in advertising is not decoration but a processing variable. Hecker (1984, Psychology & Marketing, doi:10.1002/mar.4220010303) gives the early statement of music’s role in advertising effect. The empirical core comes from a sequence of careful studies. Milliman (1982, Journal of Marketing, doi:10.1177/002224298204600313) shows in a field setting that the tempo of background music changes the pace at which supermarket shoppers move and, with it, sales volume: slow music slows shoppers and raises spending. North, Hargreaves, and McKendrick (1999, Journal of Applied Psychology, doi:10.1037/0021-9010.84.2.271) demonstrate that the style of in-store music biases product choice, with French music lifting French wine sales and German music lifting German wine sales, an effect operating largely outside shoppers’ awareness. For advertising specifically, Kellaris and Cox (1989, Journal of Consumer Research, doi:10.1086/209199) reassess earlier claims about background music and persuasion, and Kellaris, Cox, and Cox (1993, Journal of Marketing, doi:10.1177/002224299305700409) provide the contingency account that organizes the field: music helps message reception when its attention-gaining properties and its congruency with the message work together, and can hurt when they conflict. The lesson is that music’s effect is moderated by fit, not uniformly positive, which is exactly why measuring musical content matters.
Measuring that content is the province of music information retrieval (MIR), which extracts tempo and beat, key and mode (major versus minor, a strong correlate of happy versus sad), timbre via the same MFCCs used for speech, and learned embeddings of valence and arousal. These features let a researcher characterize the music in thousands of ads or podcast segments and relate it to recall, attitude, and sales at a scale the classic studies could not reach. The bridge from the classic effects literature to MIR-based measurement at scale is one of the more tractable open opportunities in the modality.
Sonic branding extends music’s logic from the spot to the brand. Zoghaib, Luffarelli, and Feiereisen (2023, Psychology & Marketing, doi:10.1002/mar.21875) show that structural properties of a brand’s music, an irregular melodic contour and an unstable tonality, raise perceived brand innovativeness and brand evaluations, with processing difficulty as the mechanism. This connects sonic branding to the same aesthetic-complexity logic that governs visual branding in Chapter 45: a moderate, productive difficulty in processing a stimulus can enhance rather than diminish response. The sound logo, the brand voice, and the scoring of a brand’s content are, on this evidence, designable assets whose acoustic properties carry measurable equity.
46.6 Industry and Production Practice
Bringing audio analysis into a working marketing organization confronts a set of realities the academic framing can understate.
The first is data access and consent. Call recordings, voice-assistant logs, and any audio containing identifiable voices are sensitive personal data. The human voice is a biometric identifier, and many jurisdictions regulate it specifically; two-party-consent recording laws, the need for a lawful basis to process voice data, and rules on automated decision-making all constrain what can be collected and modeled. The privacy and governance considerations of Chapter 24 are not an afterthought for audio but a precondition for it, and they shape the dataset before any feature is extracted.
The second is audio quality and channel effects. Telephone audio is band-limited to roughly 300 to 3400 Hz and compressed by lossy codecs; far-field smart-speaker audio carries room reverberation and background noise; podcast audio is studio-clean but heavily processed. These channel differences alter the very features a model relies on, so a pitch or emotion model trained on one channel can fail on another. Robust practice fixes the sampling rate and channel, applies consistent preprocessing (resampling, normalization, optional noise reduction), and validates within the channel of deployment.
The third is the pipeline architecture that the chapter has advocated throughout, now stated as engineering. A production system typically runs voice-activity detection to find speech, diarization to separate speakers, ASR to transcribe, and a parallel acoustic-feature extractor for prosody and emotion, then fuses the lexical and paralinguistic streams for the downstream task. Each stage is a model with its own error rate, and those errors compound, so monitoring and human spot- checking at each stage are not optional.
The fourth is build-versus-buy. A mature ecosystem of commercial speech-to-text, diarization, and emotion APIs (from the major cloud providers and specialized vendors) makes it unnecessary to train ASR from scratch, while open-source models (Whisper, wav2vec 2.0) and toolkits (librosa and pyAudioAnalysis in Python, openSMILE for standardized acoustic feature sets, and tuneR and seewave in R for the signal-processing primitives demonstrated above) make a transparent, auditable in-house pipeline feasible. The choice turns on data sensitivity (a regulated voice corpus may not leave the premises), the need for auditability in a research setting, and cost at scale. For the construct-validity reasons this book stresses, a research pipeline benefits from the transparency of open tooling even when a commercial API would be more accurate out of the box.
46.7 Frontier and Expansion
Audio is the youngest of the well-developed marketing data modalities, and several frontiers are visible. The first is self-supervised audio foundation models. Just as wav2vec 2.0 learns speech representations from unlabeled audio and Whisper learns robust transcription from weak supervision at scale, general-purpose audio embeddings increasingly provide a single representation that supports emotion, speaker, and content tasks with light fine-tuning, lowering the barrier to acoustic measurement in marketing much as pretrained image and text models did for those modalities. The second is multimodal fusion, the natural endpoint of this part: a podcast ad, a TikTok video (Chapter 43 and the image and video threads), or a live-commerce stream is simultaneously audio, speech, on-screen text, and image, and the acoustic features developed here are one stream to be fused with the others rather than analyzed alone. The third is generative and synthetic voice: text-to-speech has become good enough that brands synthesize spokesperson voices and personalize audio at scale, which turns voice from a measured variable into a designed one and raises fresh questions of authenticity, disclosure, and consumer trust. The fourth is the still-open call-analytics opportunity flagged at the start: the service and sales voice channel remains comparatively under-studied in top-tier marketing research, and a credible identification strategy applied to acoustic call features at scale is among the clearer contributions the modality invites.
The synthesizing survey of machine learning and AI in marketing by Ma and Sun (2020, International Journal of Research in Marketing, doi:10.1016/j.ijresmar.2020.04.005) situates audio within the broader move to connect computational representations of unstructured data to human marketing insight, and it is the right place to read this chapter’s modality back into the whole. The consistent thread, from the MFCCs of 1980 to the foundation models of today, is the one this part repeats for every modality: a sound becomes a learned, lossy feature vector; that vector is a generated regressor; and the marketing payoff comes from separating what is said from how it is said, then defending both measurements before trusting either.
Balducci, Bitty, and Detelina Marinova. 2018. “Unstructured Data in Marketing.”Journal of the Academy of Marketing Science 46 (4): 557–90. https://doi.org/10.1007/s11747-018-0581-x.
Balducci, Bitty, Bin Pang, Lingshu Hu, Can Li, Wenbo Wang, Yi Shang, Detelina Marinova, and Matt Gordon. 2025. “Leveraging Audio Data: A Guide to Understanding Customer-Firm Conversations.”Marketing Letters 37 (1). https://doi.org/10.1007/s11002-025-09797-z.
Busquet, Francesc, Fotis Efthymiou, and Christian Hildebrand. 2024. “Voice Analytics in the Wild: Validity and Predictive Accuracy of Common Audio-Recording Devices.”Behavior Research Methods 56 (3): 2114–34. https://doi.org/10.3758/s13428-023-02139-9.
Hildebrand, Christian, Fotis Efthymiou, Francesc Busquet, William H. Hampton, Donna L. Hoffman, and Thomas P. Novak. 2020. “Voice Analytics in Business Research: Conceptual Foundations, Acoustic Feature Extraction, and Applications.”Journal of Business Research 121: 364–74. https://doi.org/10.1016/j.jbusres.2020.09.020.
Hobson, Jessen L., William J. Mayew, and Mohan Venkatachalam. 2012. “Analyzing Speech to Detect Financial Misreporting.”Journal of Accounting Research 50 (2): 349–92. https://doi.org/10.1111/j.1475-679X.2011.00433.x.
Klaus, Phil, and Judith Lynne Zaichkowsky. 2022. “The Convenience of Shopping via Voice AI: Introducing AIDM.”Journal of Retailing and Consumer Services 65: 102490. https://doi.org/10.1016/j.jretconser.2021.102490.
Klaus, Phil, and Judy Zaichkowsky. 2020. “AI Voice Bots: A Services Marketing Research Agenda.”Journal of Services Marketing 34 (3): 389–98. https://doi.org/10.1108/jsm-01-2019-0043.
Longoni, Chiara, and Luca Cian. 2022. “Artificial Intelligence in Utilitarian Vs. Hedonic Contexts: The Word-of-Machine Effect.”Journal of Marketing 86 (1): 91–108. https://doi.org/10.1177/0022242920957347.
Mari, Alex, Andreina Mandelli, and René Algesheimer. 2024. “Empathic Voice Assistants: Enhancing Consumer Responses in Voice Commerce.”Journal of Business Research 175: 114566. https://doi.org/10.1016/j.jbusres.2024.114566.
———. 2025. “AI-Assisted Shopping on Voice Assistants: The Efficiency–Autonomy Consumer Decision Trade-Off.”Journal of Interactive Marketing. https://doi.org/10.1177/10949968251400616.
Mayew, William J., and Mohan Venkatachalam. 2012. “The Power of Voice: Managerial Affective States and Future Firm Performance.”The Journal of Finance 67 (1): 1–43. https://doi.org/10.1111/j.1540-6261.2011.01705.x.
McLean, Graeme, and Kofi Osei-Frimpong. 2019. “Hey Alexa: Examine the Variables Influencing the Use of Artificial Intelligent in-Home Voice Assistants.”Computers in Human Behavior 99: 28–37. https://doi.org/10.1016/j.chb.2019.05.009.
McLean, Graeme, Kofi Osei-Frimpong, and Jennifer Barhorst. 2021. “Alexa, Do Voice Assistants Influence Consumer Brand Engagement? Examining the Role of AI Powered Voice Assistants in Influencing Consumer Brand Engagement.”Journal of Business Research 124: 312–28. https://doi.org/10.1016/j.jbusres.2020.11.045.
Melumad, Shiri, J. Jeffrey Inman, and Michel Tuan Pham. 2019. “Selectively Emotional: How Smartphone Use Changes User-Generated Content.”Journal of Marketing Research 56 (2): 259–75. https://doi.org/10.1177/0022243718815429.
Packard, Grant, and Jonah Berger. 2021. “How Concrete Language Shapes Customer Satisfaction.”Journal of Consumer Research 47 (5): 787–806. https://doi.org/10.1093/jcr/ucaa038.
# Audio, Voice, and Speech {#sec-audio-voice}Marketing has a soundtrack. A call-center recording, a podcast ad read, avoice-assistant exchange, a thirty-second spot scored to a swelling string line,the timbre of a spokesperson's voice: each is an acoustic artifact that carriesinformation about what a person said and, just as importantly, how they said it.For most of the discipline's history that information was inaccessible at scale.Audio could be played back and judged by a human coder, but it could not be turnedinto the kind of numbers that enter a demand system, a choice model, or aregression. What has changed is that a waveform can now be transcribed,characterized, and embedded automatically across millions of recordings, so theacoustic layer of marketing becomes data in the same operational sense that text(@sec-text-as-data) and images (@sec-image-processing) already are.The intellectual move is the one this part has made repeatedly. An unstructuredartifact, here a one-dimensional pressure signal sampled in time, is mapped into afeature vector that a marketing theory can speak about. As with text and images,the representation is lossy and learned, and the central econometric cautioncarries over without modification: an acoustic feature is a generated regressorwhose measurement error can correlate with the outcome it is meant to explain, soa naive regression confounds measurement with effect. Audio adds one distinctiveopportunity, which organizes much of this chapter. A spoken utterance has twoseparable channels: the lexical channel (the words, recoverable by transcriptionand then handed to the entire text-as-data pipeline) and the paralinguisticchannel (pitch, loudness, rhythm, voice quality, the acoustic carrier of emotionand emphasis). The distinction between *what is said* and *how it is said* is thepayoff of treating audio as its own modality rather than as a noisy path to atranscript.The chapter proceeds from signal to construct to application. It first surveyswhere audio enters marketing. It then fixes what a digital sound is and developsthe classic acoustic features that have powered speech and music analysis fordecades, with a runnable demonstration that synthesizes a voice-like waveform andextracts a spectrogram, mel-frequency cepstral coefficients, a fundamentalfrequency contour, and a spectral centroid from it. With those primitives in handit turns to automatic speech recognition and speech analytics, to vocal emotionand prosody, and to music and sonic branding. It closes with the realities ofproduction practice and a short look at the frontier.## Where Audio Enters Marketing {#sec-audio-applications}Audio reaches marketing through several distinct streams, each with its own datacharacteristics and its own measurement question.The largest and least exploited stream is the *service and sales voice channel*.Contact centers and outbound sales organizations record enormous volumes of calls,historically for compliance and quality assurance, and these recordings hold botha transcript and an acoustic signal. The transcript answers what was discussed:the reason for the call, the objections raised, whether a resolution or a saleoccurred. The acoustic signal answers how the exchange unfolded: whether thecustomer's voice tightened with frustration, whether the agent's pace and warmthtracked or diverged from the customer's, where in the call the emotional tenorturned. Vocal-tone analytics on this channel promises to predict churn,satisfaction, conversion, and agent effectiveness from signals a transcript cannotsee. This is the channel the unstructured-data program was built to exploit[@balducci2018unstructured], and it is the explicit subject of @balducci2025audio,whose guide to understanding customer–firm conversations walks through the fullpath from raw call audio to validated measures—segmentation, transcription, speakerdiarization, acoustic-prosodic feature extraction, and the joint modeling of thelexical and paralinguistic channels—and is the natural methodological anchor forthis chapter. Even so, a clean top-tier marketing study analyzing raw call-centeracoustic features at scale with a verifiable identification strategy remainscomparatively rare, which is precisely why the channel is a first-classopportunity. That voice carries economically consequential signal is alreadyestablished outside marketing: @mayew2012power show that vocal markers ofmanagerial affect on earnings calls predict future firm performance, and@hobson2012analyzing detect financial misreporting from acoustic cues of cognitivedissonance in executive speech—evidence that the paralinguistic channel encodesinformation markets pay to recover.A second stream is *voice-assistant and voice-commerce* interaction. Smartspeakers and phone assistants (Alexa, Siri, Google Assistant) turn shopping,search, and brand contact into spoken dialogue, which raises questions both aboutthe consumer's relationship with the device and about the persuasive properties ofthe assistant's synthesized voice. Schweitzer, Belk, Jordan, and Ortner (2019,*Journal of Marketing Management*, doi:10.1080/0267257X.2019.1596970) frame theconsumer side: users build relationships with voice-controlled devices that rangefrom treating the device as a servant to treating it as a friend or even a master,and that relational frame shapes adoption and reliance. On the persuasion side,Flavián, Akdim, and Casaló (2023, *Psychology & Marketing*,doi:10.1002/mar.21765) show experimentally that voice-based recommendations from avirtual assistant can move consumer behavior more than text-based online reviews,with credibility and usefulness mediating the effect. Voice commerce makes theacoustic properties of a machine's voice a marketing-mix variable. Adoption ofin-home assistants is governed by perceived usefulness, enjoyment, and socialpresence [@mclean2019heyalexa], and sustained use feeds back into brand engagement[@mclean2021alexa]; the services-marketing agenda these devices open—from theconvenience calculus of voice shopping to the design of an assistant's persona—ismapped by @klaus2020aivoicebots and @klaus2022aidm. The frontier is affective andstrategic: @mari2024empathic find that prosodically empathic assistants liftconsumer responses in voice commerce, while @mari2025aiassisted document anefficiency–autonomy trade-off when consumers delegate shopping to a voice agent.And because the recommendation now issues from a machine, the "word-of-machine"effect applies directly: consumers weight human versus algorithmic recommendersdifferently for utilitarian versus hedonic goals [@longoni2022wordofmachine], acontingency every voice-commerce deployment inherits.A third stream is *advertising and branded audio*: radio and television ad audio,podcast host-read and programmatic insertions, and the music and sound design ofspots across media. Here the questions are classic advertising questions (recall,attitude, persuasion) asked of an acoustic stimulus, and the literature on musicin advertising, reviewed below, is the oldest and most developed thread in themodality.A fourth stream is *sonic branding*: the deliberate design of a brand's auditoryidentity, from the multi-second sound logo (the Intel chime, the Netflix"ta-dum," a startup tone) to the chosen voice of a brand's spokes-character orassistant. Sonic branding treats sound the way visual branding treats a logo orcolor palette, as a recognizable, ownable, affect-laden asset.Across all four streams the same architecture recurs and is worth adopting as adefault. Run automatic speech recognition to obtain a transcript and feed it to thetext pipeline; separately extract acoustic-prosodic features as their own signal;and model the two channels jointly, never collapsing how-it-was-said intowhat-was-said. The remainder of the chapter equips each half of that architecture.## What a Sound Is, and Its Classic Features {#sec-audio-features}Formally, a digital audio signal is a function of time sampled on a uniform grid.A continuous sound pressure wave $x(t)$ is recorded as a sequence$$x[n] = x(n / f_s), \qquad n = 0, 1, \dots, N-1,$$ {#eq-audio-sampling}where $f_s$ is the sampling rate in samples per second (hertz) and the number ofsamples $N = f_s \cdot T$ for a recording of duration $T$. TheNyquist-Shannon sampling theorem requires $f_s$ to be at least twice the highestfrequency of interest, which is why telephone speech is adequately captured at 8kHz (human speech intelligibility lives mostly below 4 kHz) while music is sampledat 44.1 kHz to preserve content up to roughly 22 kHz. Each sample $x[n]$ is asingle amplitude value, the analogue of a pixel, and like a pixel it is almostmeaningless in isolation. What carries information is structure across manysamples: periodicity, spectral content, and how both evolve in time.Because the informative structure is spectral and time-varying, the raw waveformis almost never modeled directly. The foundational transform is theshort-time Fourier transform, which slides a short analysis window across thesignal and computes, in each window, the distribution of energy across frequency.The squared magnitude of this transform, displayed as time on the horizontal axis,frequency on the vertical, and energy as intensity, is the *spectrogram*, thesingle most important visualization in audio analysis. A spectrogram of speechreveals the horizontal bands of energy (formants) that distinguish vowels, thebroadband bursts of consonants, and the vertical striations of the glottal pulsetrain whose spacing encodes pitch.From the spectrogram, a small set of classic features has organized speech andmusic analysis for four decades.The *fundamental frequency* $f_0$ is the rate at which the vocal folds vibrate, theacoustic correlate of perceived pitch. Its average level, its range, and the shapeof its contour over an utterance carry speaker identity, emphasis, question-versusstatement intonation, and a large share of vocal emotion.*Mel-frequency cepstral coefficients* (MFCCs) are the workhorse representation ofthe spectral envelope, that is, of the overall shape of the spectrum thatdistinguishes one vowel or timbre from another independent of pitch. MFCCs wereintroduced by Davis and Mermelstein (1980, *IEEE Transactions on Acoustics, Speech,and Signal Processing*, doi:10.1109/TASSP.1980.1163420) and remain ubiquitous. Theyare computed by warping the spectrum onto the perceptual *mel* scale (which spacesfrequencies the way the human ear resolves them, finely at low frequencies andcoarsely at high), taking logarithms of the energy in a bank of mel filters, andapplying a discrete cosine transform to decorrelate the result. The first dozen orso coefficients compactly summarize timbre and are the standard input to speaker,emotion, and (historically) speech recognition.*Spectral shape* features summarize the distribution of energy across frequency ina single number each. The most common is the *spectral centroid*, theamplitude-weighted mean frequency, which corresponds to perceived brightness: abright, sharp voice or a cymbal has a high centroid, a dull or muffled sound a lowone. Related measures include spectral spread, rolloff, and flatness.*Energy and rhythm* features (the short-time energy or loudness contour, thespeaking rate, the distribution of pause durations) capture intensity and timing,which carry arousal, emphasis, and conversational dynamics.*Voice-quality* features (jitter, the cycle-to-cycle variability in $f_0$; shimmer,the variability in amplitude; and the harmonics-to-noise ratio) quantify theroughness, breathiness, or strain of a voice, and are sensitive markers of stressand emotion.These features are interpretable, cheap, and defensible, the audio analogue of thehand-engineered color and composition features of @sec-image-processing. They weredesigned by acoustic phoneticians with explicit perceptual motivation, and theyremain the right tool when the construct is well understood, the sample is modest,or interpretability is paramount. @tbl-audio-features maps each family of acousticfeatures to the marketing construct it most directly measures and to a representativestudy, so the demonstration that follows can be read as instrumenting constructs,not merely transforming a signal.```{r}#| label: tbl-audio-features#| tbl-cap: "Classic acoustic feature families, the perceptual and marketing constructs they carry, and representative work. The features are the audio analogue of hand-engineered visual features."#| message: false#| warning: falselibrary(knitr)af <-data.frame(`Feature family`=c("Fundamental frequency (f0)", "MFCCs (spectral envelope)","Spectral shape (centroid, rolloff)", "Energy & rhythm (loudness, rate, pauses)","Voice quality (jitter, shimmer, HNR)"),`Perceptual correlate`=c("Pitch, intonation", "Timbre, vowel/voice identity","Brightness", "Arousal, emphasis, pacing", "Roughness, breathiness, strain"),`Marketing construct`=c("Emotion, persuasion, gender", "Speaker/emotion ID","Persuasiveness, affect", "Engagement, conversational dynamics","Stress, sincerity, emotion"),check.names =FALSE)kable(af)```The move from these hand-engineered descriptors to a reproducible measurement*workflow* is the subject of a small methods literature aimed squarely at businessand marketing researchers. @hildebrand2020voice lay the conceptual foundations ofvoice analytics for business research—what the acoustic channel can and cannotmeasure, which features to extract, and how to validate them—and @busquet2024voicestress-test that pipeline "in the wild," showing that the validity and predictiveaccuracy of common acoustic features depend materially on the recording device, adirect caution for any study that mixes telephone, smart-speaker, and studio audio.Crucially, this line ships usable tooling rather than prescriptions alone.::: {.callout-tip}## Replication resources: voice analyticsThe acoustic-feature pipeline of @hildebrand2020voice and @busquet2024voice isreleased as the open R package **voiceR** [@busquet2023voicer], available on CRAN,which automates batch extraction, normalization, and summary of the features in@tbl-audio-features and is a turnkey way to reproduce the kind of analysis the demonstrationbelow builds by hand. For raw signal processing the maintained R packages `tuneR`and `seewave` (used below) and the Python `librosa`, `pyAudioAnalysis`, and thestandardized `openSMILE` feature sets remain the reference toolchain.:::### A Runnable Demonstration: From Waveform to Features {#sec-audio-demo}The demonstration below synthesizes a short voice-like waveform with knownproperties and then extracts the four classic features just described, using themaintained R packages `tuneR` and `seewave`. The signal is deliberately simple andfully simulated: a glottal source modeled as a fundamental plus harmonics, with aslow vibrato so the pitch contour is non-trivial, shaped by an amplitude envelopeand contaminated with a little noise. Because the construction is known, theextracted features can be checked against ground truth, which is the point of asimulation. Everything here runs on synthetic data and uses only signal-processingfunctions, not deep models.```{r audio-setup, message=FALSE, warning=FALSE}# Maintained CRAN packages for audio analysis in R.library(tuneR) # Wave objects, MFCCs (melfcc)library(seewave) # spectrogram, fundamental frequency, spectral measuresset.seed(52)``````{r audio-synthesize, message=FALSE, warning=FALSE}# ---- Synthesize a voice-like waveform with KNOWN properties ----fs <-16000# sampling rate (Hz); 16 kHz is standard for speechdur <-1.2# secondst <-seq(0, dur, by =1/ fs) # time gridN <-length(t)f0_mean <-140# mean fundamental frequency (Hz), a low-ish voicevibrato <-6*sin(2* pi *4* t) # +/- 6 Hz pitch wobble at 4 Hz (a gentle vibrato)f0_t <- f0_mean + vibrato # instantaneous f0 over time# Instantaneous phase = 2*pi * integral of f0(t); cumulative sum approximates it.phase <-2* pi *cumsum(f0_t) / fs# A glottal-like source: fundamental + decaying harmonics (gives a vowel-ish timbre).harm_amp <-c(1.0, 0.6, 0.35, 0.2, 0.12)sig <-rep(0, N)for (h inseq_along(harm_amp)) { sig <- sig + harm_amp[h] *sin(h * phase)}# Amplitude envelope (fade in/out) and a little additive noise (breathiness).env <-sin(pi *seq_len(N) / N)^0.5sig <- env * sig +rnorm(N, 0, 0.03)# Wrap as a normalized 16-bit mono Wave object (tuneR's core data structure).wav <-Wave(left = sig, samp.rate = fs, bit =16)wav <-normalize(wav, unit ="16")wav```The object `wav` is now an ordinary `Wave`, the same structure `tuneR` wouldproduce from reading a `.wav` file off disk, so every step that follows appliesidentically to real recordings.```{r audio-spectrogram, message=FALSE, warning=FALSE, fig.cap="Spectrogram of the synthesized voice-like signal. Horizontal bands are the harmonics of the fundamental; their gentle waviness is the simulated vibrato.", fig.width=6, fig.height=4}# ---- Spectrogram: energy across time (x) and frequency (y) ----# wl is the analysis-window length in samples; ovlp the percent overlap.seewave::spectro(wav, f = fs, wl =512, ovlp =75,collevels =seq(-60, 0, 5), flim =c(0, 2),main ="Spectrogram (synthetic voice)")```The spectrogram shows a stack of horizontal bands, the fundamental near 140 Hz andits harmonics above it, each rippling slightly with the 4 Hz vibrato that was builtin. This is exactly the harmonic structure a sustained vowel produces.```{r audio-f0, message=FALSE, warning=FALSE, fig.cap="Estimated fundamental-frequency (pitch) contour. The estimate hovers near the 140 Hz that was synthesized and traces the vibrato.", fig.width=6, fig.height=3.5}# ---- Fundamental frequency contour ----# seewave::fund returns a matrix of (time, f0-in-kHz); we convert to Hz.ff <- seewave::fund(wav, f = fs, wl =512, ovlp =75,fmax =600, plot =FALSE)f0_hz <- ff[, 2] *1000# kHz -> Hzf0_hz <- f0_hz[is.finite(f0_hz) & f0_hz >0]plot(f0_hz, type ="l", xlab ="Frame", ylab ="f0 (Hz)",main ="Estimated fundamental frequency")abline(h = f0_mean, col ="red", lty =2) # the true mean we synthesizedround(c(true_f0_mean = f0_mean,est_f0_median =median(f0_hz)), 1)```The estimated pitch track sits close to the dashed line at the true mean of 140 Hz,recovering a property that was put into the signal by construction. In a realrecording this same contour is what carries question intonation, emphasis, and muchof the emotional signal developed below.```{r audio-mfcc, message=FALSE, warning=FALSE}# ---- Mel-frequency cepstral coefficients (timbre) ----# tuneR::melfcc returns a matrix: one row per analysis frame, one column per coefficient.mf <-melfcc(wav, sr = fs, numcep =13,wintime =0.025, hoptime =0.010) # 25 ms windows, 10 ms hopcat("MFCC matrix dimensions (frames x coefficients):",paste(dim(mf), collapse =" x "), "\n")# Per-utterance summary: the mean of each coefficient across frames.# Aggregating frame-level features to a fixed-length vector is the standard way# to turn a variable-length recording into one row of a design matrix.mfcc_means <-round(colMeans(mf), 3)names(mfcc_means) <-paste0("c", 0:12)mfcc_means```The MFCC matrix has one row per short analysis frame and thirteen columns. Arecording of any length therefore yields a variable number of frames, and thelast lines show the standard device for handling that: summarize the frames (hereby their mean) into a single fixed-length vector that can become one row of aregression or classifier design matrix. The first coefficient `c0` reflectsoverall log-energy; the rest encode the spectral envelope, that is, the timbre.```{r audio-centroid, message=FALSE, warning=FALSE}# ---- Spectral centroid (perceived brightness) ----# Compute the mean spectrum, then the amplitude-weighted mean frequency.ms <- seewave::meanspec(wav, f = fs, wl =512, plot =FALSE)freq_khz <- ms[, 1] # frequency axis in kHzamp <- ms[, 2] # relative amplitudecentroid_hz <-sum(freq_khz * amp) /sum(amp) *1000# seewave::specprop packages the same and related descriptors; 'cent' is the# centroid in Hz. We report both to show they agree.sp <- seewave::specprop(ms, f = fs)round(c(centroid_manual_hz = centroid_hz,centroid_specprop_hz =as.numeric(sp$cent)), 1)```The spectral centroid, computed both by hand as the amplitude-weighted meanfrequency and via `seewave::specprop`, lands in the low hundreds of hertz,consistent with a low-pitched voice whose energy is concentrated in thefundamental and its first few harmonics. A brighter sound (more high-frequencyenergy) would push this number up. With four lines each, this demonstration hasturned a one-dimensional waveform into a pitch contour, a timbre vector, and abrightness scalar, the same primitives that feed every application that follows.## Automatic Speech Recognition and Speech Analytics {#sec-audio-asr}The lexical channel of audio is unlocked by *automatic speech recognition* (ASR),the task of transcribing speech to text. ASR is the bridge that lets the entiretext-as-data apparatus of @sec-text-as-data, that is, topic models, sentiment andstance classifiers, embeddings, and large-language-model extraction, operate onspoken marketing data. The history of ASR runs from the MFCC-plus-hidden-Markovsystems of the 1980s and 1990s, through hybrid deep-neural-network systems in the2010s, to today's end-to-end sequence models trained on very large corpora. Thecontemporary reference point is OpenAI's Whisper, an encoder-decoder transformertrained on a large multilingual, multitask corpus that produces robust transcriptsacross accents, noise, and domains and has become a common default for researchpipelines. Wav2vec 2.0 represents the parallel self-supervised line: it learnsspeech representations from unlabeled audio that can then be fine-tuned forrecognition with relatively little labeled data. These deep-ASR systems are besttreated here as conceptual building blocks rather than something to run inline:they require substantial compute and large pretrained weights, and unlike thesignal-processing demonstration above they are not lightweight enough to execute ina book's render pipeline. The practical posture is to call them as a service or apreprocessing step, then verify their output, because transcription error is ameasurement error that propagates into every downstream text feature.Several capabilities sit alongside transcription and together constitute *speechanalytics*. *Speaker diarization* segments a recording into "who spoke when,"essential for separating the agent from the customer in a service call or the hostfrom the guest in a podcast. *Speaker identification and verification* match avoice to a known identity, underpinning voice biometrics and fraud detection.*Keyword and intent spotting* flags compliance-relevant phrases, product mentions,or competitor references without a full transcript. *Language and accentidentification* route and segment multilingual corpora. In a production speech-analytics stack these run before or alongside ASR, and their errors, like ASR's,are generated-feature errors that the downstream analysis must respect.The strategic value of the lexical channel in marketing is that it converts thespoken word, previously locked in un-searchable audio, into the samerepresentations that have made review text and social-media text so productive. Acorpus of sales-call transcripts can be mined for the objections that precede alost deal, the language that precedes a renewal, or the product features customersactually ask about, exactly as a corpus of reviews is mined for productattributes. And the *style* of the transcribed words, not only their content,carries outcome-relevant signal: @packard2021concrete show that the concreteness offrontline-employee language raises customer satisfaction and purchase, recoveredfrom the lexical channel of service interactions, while @melumad2019selectively showthat the device a message is composed on systematically shifts its emotionality, areminder that the transcript's linguistic features are themselves shaped by thecapture context. But transcription alone discards the paralinguistic channel, and thatchannel is where audio earns its place as a distinct modality.## Vocal Emotion and Prosody: How It Is Said {#sec-audio-prosody}The acoustic-prosodic channel carries information that the words do not. The samesentence, "that's just great," can be sincere or sarcastic (@sec-sarcasm), and thedifference lives almost entirely in pitch, timing, and voice quality rather than inthe lexical content. *Prosody*, the melody and rhythm of speech, comprises the$f_0$ contour, the loudness contour, speaking rate and pausing, and voice quality.These are the features the demonstration above extracted, and they are thesubstrate of vocal emotion.The scientific anchor is Scherer (2003, *Speech Communication*,doi:10.1016/S0167-6393(02)00084-5), whose review of vocal emotion communicationlays out both the production side (how emotions systematically modulateacoustic parameters: anger raises and broadens pitch and energy, sadness lowers andnarrows them, and so on) and the recognition side (how listeners decode thoseparameters). This is the theory that licenses treating acoustic features asmeasures of an emotional construct rather than as arbitrary numbers. On theengineering side, *speech emotion recognition* (SER) builds classifiers that mapacoustic features to emotion labels. El Ayadi, Kamel, and Karray (2011,*Pattern Recognition*, doi:10.1016/j.patcog.2010.09.020) survey the classicpipeline (hand-engineered prosodic, spectral, and voice-quality features feeding aclassifier) and the recurring difficulties: emotion is continuous andcontext-dependent, labeled datasets are small and often acted rather thanspontaneous, and accuracy degrades sharply when a model trained on one corpus orlanguage meets another. Modern SER replaces or augments the hand-engineeredfeatures with learned audio embeddings of the wav2vec family, but the validityproblems the survey names (label quality, acted-versus-spontaneous mismatch,cross-corpus generalization) are exactly the construct-validity concerns(@sec-construct-vs-variable) that this book insists on, and they do not disappearwhen the feature extractor becomes a neural network.Prosody also carries persuasion, not only emotion. Zoghaib (2019, *Recherche etApplications en Marketing*, doi:10.1177/2051570719828687) manipulates a speaker'svoice along acoustic dimensions and finds that lower-pitched, smoother (lessrough), and duller (less bright) voices are the more persuasive, with effects thatinteract with speaker gender. This is a direct demonstration that the very featuresthe demonstration computed, $f_0$ level and spectral centroid and voice-qualityroughness, are not bookkeeping quantities but marketing-relevant levers that shapeconsumer response to the same words. The managerial reading is that a brandchoosing a spokesperson, a voice actor for an ad, or a timbre for a synthesizedassistant is choosing a persuasion parameter, and that choice can now be measuredacoustically rather than left to intuition.The discipline's standing caution applies with force here. Acoustic-emotionfeatures are generated regressors, and the model that generates them was trained ondata whose emotional labels, speaker demographics, and recording conditions maydiffer systematically from the marketing setting of interest. A vocal-frustrationscore that is more accurate for some accents than others, or for studio audio thanfor telephone audio, injects a bias that can correlate with the outcome. Theremedy is the one used throughout this part: validate the generated feature againsthuman judgment on a held-out sample from the target domain before trusting it in adownstream model.## Music and Sonic Branding {#sec-audio-music}Music is the oldest and most developed thread in audio-as-marketing, and theresearch is unusually clear that music in advertising is not decoration but aprocessing variable. Hecker (1984, *Psychology & Marketing*,doi:10.1002/mar.4220010303) gives the early statement of music's role inadvertising effect. The empirical core comes from a sequence of careful studies.Milliman (1982, *Journal of Marketing*, doi:10.1177/002224298204600313) shows in afield setting that the tempo of background music changes the pace at whichsupermarket shoppers move and, with it, sales volume: slow music slows shoppers andraises spending. North, Hargreaves, and McKendrick (1999, *Journal of AppliedPsychology*, doi:10.1037/0021-9010.84.2.271) demonstrate that the *style* ofin-store music biases product choice, with French music lifting French wine salesand German music lifting German wine sales, an effect operating largely outsideshoppers' awareness. For advertising specifically, Kellaris and Cox (1989,*Journal of Consumer Research*, doi:10.1086/209199) reassess earlier claims aboutbackground music and persuasion, and Kellaris, Cox, and Cox (1993, *Journal ofMarketing*, doi:10.1177/002224299305700409) provide the contingency account thatorganizes the field: music helps message reception when its attention-gainingproperties and its congruency with the message work together, and can hurt whenthey conflict. The lesson is that music's effect is moderated by fit, not uniformlypositive, which is exactly why measuring musical content matters.Measuring that content is the province of *music information retrieval* (MIR), whichextracts tempo and beat, key and mode (major versus minor, a strong correlate ofhappy versus sad), timbre via the same MFCCs used for speech, and learnedembeddings of valence and arousal. These features let a researcher characterize themusic in thousands of ads or podcast segments and relate it to recall, attitude,and sales at a scale the classic studies could not reach. The bridge from theclassic effects literature to MIR-based measurement at scale is one of the moretractable open opportunities in the modality.Sonic branding extends music's logic from the spot to the brand. Zoghaib,Luffarelli, and Feiereisen (2023, *Psychology & Marketing*,doi:10.1002/mar.21875) show that structural properties of a brand's music, anirregular melodic contour and an unstable tonality, raise perceived brandinnovativeness and brand evaluations, with processing difficulty as the mechanism.This connects sonic branding to the same aesthetic-complexity logic that governsvisual branding in @sec-image-processing: a moderate, productive difficulty inprocessing a stimulus can enhance rather than diminish response. The sound logo, thebrand voice, and the scoring of a brand's content are, on this evidence, designableassets whose acoustic properties carry measurable equity.## Industry and Production Practice {#sec-audio-practice}Bringing audio analysis into a working marketing organization confronts a set ofrealities the academic framing can understate.The first is *data access and consent*. Call recordings, voice-assistant logs, andany audio containing identifiable voices are sensitive personal data. The humanvoice is a biometric identifier, and many jurisdictions regulate it specifically;two-party-consent recording laws, the need for a lawful basis to process voicedata, and rules on automated decision-making all constrain what can be collectedand modeled. The privacy and governance considerations of @sec-privacy are not anafterthought for audio but a precondition for it, and they shape the dataset beforeany feature is extracted.The second is *audio quality and channel effects*. Telephone audio is band-limitedto roughly 300 to 3400 Hz and compressed by lossy codecs; far-field smart-speakeraudio carries room reverberation and background noise; podcast audio is studio-cleanbut heavily processed. These channel differences alter the very features a modelrelies on, so a pitch or emotion model trained on one channel can fail on another.Robust practice fixes the sampling rate and channel, applies consistentpreprocessing (resampling, normalization, optional noise reduction), and validateswithin the channel of deployment.The third is the *pipeline architecture* that the chapter has advocated throughout,now stated as engineering. A production system typically runs voice-activitydetection to find speech, diarization to separate speakers, ASR to transcribe, anda parallel acoustic-feature extractor for prosody and emotion, then fuses thelexical and paralinguistic streams for the downstream task. Each stage is a modelwith its own error rate, and those errors compound, so monitoring and human spot-checking at each stage are not optional.The fourth is *build-versus-buy*. A mature ecosystem of commercial speech-to-text,diarization, and emotion APIs (from the major cloud providers and specializedvendors) makes it unnecessary to train ASR from scratch, while open-source models(Whisper, wav2vec 2.0) and toolkits (`librosa` and `pyAudioAnalysis` in Python,`openSMILE` for standardized acoustic feature sets, and `tuneR` and `seewave` in Rfor the signal-processing primitives demonstrated above) make a transparent,auditable in-house pipeline feasible. The choice turns on data sensitivity (aregulated voice corpus may not leave the premises), the need for auditability in aresearch setting, and cost at scale. For the construct-validity reasons this bookstresses, a research pipeline benefits from the transparency of open tooling evenwhen a commercial API would be more accurate out of the box.## Frontier and Expansion {#sec-audio-frontier}Audio is the youngest of the well-developed marketing data modalities, and severalfrontiers are visible. The first is *self-supervised audio foundation models*. Justas wav2vec 2.0 learns speech representations from unlabeled audio and Whisperlearns robust transcription from weak supervision at scale, general-purpose audioembeddings increasingly provide a single representation that supports emotion,speaker, and content tasks with light fine-tuning, lowering the barrier toacoustic measurement in marketing much as pretrained image and text models did forthose modalities. The second is *multimodal fusion*, the natural endpoint of thispart: a podcast ad, a TikTok video (@sec-text-as-data and the image and videothreads), or a live-commerce stream is simultaneously audio, speech, on-screentext, and image, and the acoustic features developed here are one stream to befused with the others rather than analyzed alone. The third is *generative andsynthetic voice*: text-to-speech has become good enough that brands synthesizespokesperson voices and personalize audio at scale, which turns voice from ameasured variable into a designed one and raises fresh questions of authenticity,disclosure, and consumer trust. The fourth is the still-open *call-analyticsopportunity* flagged at the start: the service and sales voice channel remainscomparatively under-studied in top-tier marketing research, and a credibleidentification strategy applied to acoustic call features at scale is among theclearer contributions the modality invites.The synthesizing survey of machine learning and AI in marketing by Ma and Sun(2020, *International Journal of Research in Marketing*,doi:10.1016/j.ijresmar.2020.04.005) situates audio within the broader move toconnect computational representations of unstructured data to human marketinginsight, and it is the right place to read this chapter's modality back into thewhole. The consistent thread, from the MFCCs of 1980 to the foundation models oftoday, is the one this part repeats for every modality: a sound becomes a learned,lossy feature vector; that vector is a generated regressor; and the marketingpayoff comes from separating what is said from how it is said, then defending bothmeasurements before trusting either.