Marketing has always been a visual discipline. A logo, a package, an advertisement, an influencer’s selfie, a product photograph on a retail site—each is a deliberately constructed image meant to move attention, belief, and demand. Until recently those images were data only in a metaphorical sense: researchers looked at them, coded them by hand, and ran small experiments on a handful of manipulated stimuli. What has changed is that an image can now be turned into numbers—high-dimensional, machine-readable features—at the scale of millions of photographs, and those numbers can be entered into the same demand systems, choice models, and regressions that the rest of this book develops. This chapter is about how to do that responsibly: how to extract brand, aesthetic, and content features from images, what the deep-learning machinery underneath actually computes, and how to deploy the resulting features in advertising and social-media research without fooling oneself.
The intellectual move mirrors the one made for text (Chapter 43). There, unstructured language becomes a document–term matrix or a sequence of embeddings; here, an unstructured raster of pixels becomes a feature vector. In both cases the representation is lossy and learned, and in both cases the central empirical danger is the same: the features are generated by a model whose errors are correlated with the very outcomes the researcher wants to explain, so naive regression confounds measurement with effect. The payoff for getting it right is large. Image features let the analyst measure constructs that were previously locked inside qualitative judgment—a brand’s visual identity, an ad’s aesthetic appeal, the “warmth” of a product photo—and relate them to clicks, engagement, sales, and firm value.
The chapter proceeds from pixels to constructs to applications. It first fixes what an image is as a mathematical object and what we mean by an image feature. It then develops the workhorse of modern computer vision—the convolutional neural network—giving its estimator, its loss, and the assumptions under which a feature extracted from it is meaningful. With that in hand it turns to the three families of marketing-relevant features (brand, aesthetic, content) and to the econometrics of using generated image features as regressors. It closes with applications to advertising and social media, and with the identification pitfalls that separate a credible image-as-data study from a decorative one.
45.1 What an Image Is
Formally, a digital image is a function sampled on a grid. A color image of height \(H\) and width \(W\) is a third-order tensor
\[
\mathbf{X} \in [0,1]^{H \times W \times C}, \qquad C = 3,
\tag{45.1}\]
where the three channels \(C\) hold red, green, and blue intensities and each entry \(\mathbf{X}_{ijc}\) is a normalized pixel value. A modest \(224 \times 224\) RGB image—the canonical input size for many vision models—already lives in a space of dimension \(224 \times 224 \times 3 \approx 1.5 \times 10^{5}\). This is the curse of dimensionality in its rawest form: the number of pixels vastly exceeds the number of labeled examples in any marketing dataset, and pixels are individually almost meaningless. The pixel at position \((112, 60)\) carries no stable interpretation; what matters is the arrangement of pixels into edges, textures, objects, and scenes.
Two properties of images dictate everything about how they are modeled. First, locality: meaningful structure (an edge, a corner) is built from nearby pixels, so a useful representation should aggregate local neighborhoods before global ones. Second, translation structure: a logo is the same logo whether it sits in the top-left or the center of the frame, so a useful representation should respond similarly to a pattern regardless of where it appears. A model that ignored these properties—say, a fully connected network treating each pixel as an unrelated input—would have to relearn “what an edge looks like” separately at every location and would need astronomically more data to do so. The architectures that dominate computer vision are precisely those that build locality and translation structure in by construction.
Note
An image feature is any function \(f: [0,1]^{H \times W \times C} \to
\mathbb{R}^{d}\) that maps a raw image to a lower-dimensional vector intended to capture a construct of interest. Features range from hand-engineered and interpretable (mean saturation, number of detected faces, fraction of the frame occupied by a brand logo) to learned and opaque (the 2{,}048-dimensional penultimate-layer activations of a deep network). The art of images-as-data is choosing features whose dimensions a marketing theory can speak about.
45.2 Classical Features: Color, Composition, and Hand-Engineering
Before deep learning, vision in marketing relied on hand-engineered features: quantities a researcher computes with an explicit formula and can defend to a referee line by line. They remain valuable precisely because they are transparent, and several map directly onto long-standing aesthetic theory.
Color is the most tractable. RGB is poor for human-meaningful description because its axes (red, green, blue intensity) do not correspond to how people talk about color, so analysts transform to the HSV space (hue, saturation, value), in which hue is the dominant wavelength, saturation the colorfulness, and value the brightness. From an image one can compute the mean and dispersion of each channel, the share of warm versus cool hues, and the colorfulness index, and relate these to response. This connects to a classical account of aesthetic preference: Berlyne (1960) argued that hedonic value is an inverted-U function of arousal potential—stimuli that are too simple bore and too complex overwhelm, with moderate complexity, novelty, and contrast preferred. Color statistics, visual entropy, and edge density are all operationalizations of arousal potential, and the inverted-U is a recurring empirical shape in this literature.
Composition features quantify where content sits and how much of it there is. Visual complexity—the amount and variety of detail in an image—has a long pedigree in advertising research, where Pieters, Wedel, and Batra (2010) distinguish feature complexity (irregular, dense visual elements) from design complexity (the elaborateness of the deliberate arrangement) and show the two have opposite effects on attention and attitude: feature complexity hurts brand attention while design complexity helps it. Earlier, Pieters, Wedel, and Zhang (2007) established that the eye-trackable structure of an ad—how gaze is distributed across the brand, the pictorial, and the text elements—predicts memory. Visual complexity is commonly proxied by file size after compression, by edge density from a Sobel or Canny filter, or by the entropy of the color histogram. The point of cataloguing these is not nostalgia: hand-engineered features are still the right tool when the construct is simple, the sample is small, or interpretability is paramount, and they make excellent controls alongside learned features.
A compact illustration computes interpretable color and complexity features for a synthetic image and shows how they vary with content.
Code
set.seed(34)# Build three synthetic 64x64 RGB images with known properties:# (a) a calm, low-saturation gray scene; (b) a vivid warm scene;# (c) a high-complexity noisy scene. Each is an array [H, W, C] in [0,1].make_image<-function(base, noise_sd){arr<-array(base, dim =c(64, 64, 3))arr<-arr+array(rnorm(64*64*3, 0, noise_sd), dim =dim(arr))pmin(pmax(arr, 0), 1)}img_calm<-make_image(c(0.55, 0.55, 0.55), 0.02)# near-gray, low noiseimg_warm<-make_image(c(0.85, 0.35, 0.15), 0.04)# warm (red/orange)img_busy<-make_image(c(0.50, 0.50, 0.50), 0.25)# high-variance "busy"# RGB -> HSV saturation and a colorfulness proxy (Hasler-Susstrunk style).sat_value<-function(im){mx<-pmax(im[, , 1], im[, , 2], im[, , 3])mn<-pmin(im[, , 1], im[, , 2], im[, , 3])mean(ifelse(mx>0, (mx-mn)/mx, 0))# mean saturation}colorfulness<-function(im){rg<-im[, , 1]-im[, , 2]yb<-0.5*(im[, , 1]+im[, , 2])-im[, , 3]sqrt(sd(rg)^2+sd(yb)^2)+0.3*sqrt(mean(rg)^2+mean(yb)^2)}# Edge density as a translation-invariant complexity proxy: mean absolute# horizontal+vertical gradient of the luminance channel.edge_density<-function(im){lum<-0.299*im[, , 1]+0.587*im[, , 2]+0.114*im[, , 3]gx<-abs(lum[-1, ]-lum[-nrow(lum), ])gy<-abs(lum[, -1]-lum[, -ncol(lum)])mean(gx)+mean(gy)}features<-data.frame( image =c("calm", "warm", "busy"), saturation =c(sat_value(img_calm), sat_value(img_warm), sat_value(img_busy)), colorfulness =c(colorfulness(img_calm), colorfulness(img_warm), colorfulness(img_busy)), edge_density =c(edge_density(img_calm), edge_density(img_busy), edge_density(img_busy)))features[, -1]<-round(features[, -1], 3)knitr::kable(features, caption ="Interpretable color and complexity features for three synthetic images.")
Interpretable color and complexity features for three synthetic images.
image
saturation
colorfulness
edge_density
calm
0.059
0.037
0.030
warm
0.823
0.679
0.361
busy
0.577
0.452
0.361
The warm image scores high on saturation and colorfulness; the busy image scores high on edge density—the digital analogue of Berlyne (1960)’s arousal potential. These features are cheap, reproducible, and interpretable, but they are blind to meaning: they cannot tell a logo from a face. For meaning, the field turns to learned representations.
45.3 Deep Representations: The Convolutional Neural Network
The central tool of modern computer vision is the convolutional neural network (CNN). Its design is a direct response to the two properties of images named above. Rather than connect every pixel to every hidden unit, a CNN slides small learnable filters across the image, so each unit sees only a local neighborhood (locality) and the same filter is applied at every position (translation structure, and an enormous reduction in parameters).
45.3.1 The Convolution Operation
A convolutional layer applies a bank of small filters (kernels) to its input. Let the input be a feature map \(\mathbf{Z} \in \mathbb{R}^{H \times W \times
C_{\text{in}}}\) and let a single filter be \(\mathbf{K} \in \mathbb{R}^{k \times k
\times C_{\text{in}}}\) with bias \(b\). The output at spatial location \((i,j)\) is
where \(\sigma(\cdot)\) is a nonlinearity, almost always the rectified linear unit\(\sigma(z) = \max(0, z)\). A layer holds \(C_{\text{out}}\) such filters, producing an output tensor with \(C_{\text{out}}\) channels. Three properties make Equation 45.2 the right primitive. Parameter sharing: one filter, with \(k^2 C_{\text{in}} + 1\) parameters, is reused at every location, so a layer learns “detect this pattern anywhere” rather than memorizing locations. Local connectivity: each output depends only on a \(k \times k\) window, encoding locality. Translation equivariance: shifting the input shifts the output identically, so a logo detector fires wherever the logo appears.
Convolutional layers are interleaved with pooling layers that downsample—most commonly max pooling, which reports the maximum activation in each small window—building tolerance to small shifts and shrinking the spatial resolution while the channel dimension grows. Stacking these operations yields a representational hierarchy that has been verified empirically: early layers respond to oriented edges and color blobs, middle layers to textures and motifs, and late layers to object parts and whole objects (Zeiler and Fergus 2014). Figure 45.1 sketches the pipeline from raw pixels to a task head.
flowchart LR
A["Raw image<br/>H × W × 3"] --> B["Conv + ReLU<br/>(edges, color)"]
B --> C["Pool<br/>(downsample)"]
C --> D["Conv + ReLU<br/>(textures, motifs)"]
D --> E["Conv + ReLU<br/>(object parts)"]
E --> F["Embedding z ∈ ℝ^d<br/>(penultimate layer)"]
F --> G["Task head"]
G --> H1["Classification<br/>(brand present?)"]
G --> H2["Detection<br/>(where is the logo?)"]
G --> H3["Regression<br/>(aesthetic score)"]
Figure 45.1: A convolutional network as a feature extractor with a task-specific head. Early layers encode generic low-level structure; deep layers encode semantic content. Marketing applications usually freeze the backbone and read off the penultimate-layer embedding.
45.3.2 The Estimator and Its Loss
A CNN with parameters \(\boldsymbol{\theta}\) (all filter weights and biases) defines a map \(g_{\boldsymbol{\theta}}: \mathbf{X} \mapsto \hat{\mathbf{y}}\). For a classification task with \(L\) labels (e.g., “contains a car,” “contains a dog”), the final layer produces a probability vector via the softmax, \(\hat{p}_\ell = \exp(s_\ell)/\sum_{m}\exp(s_m)\), where \(s_\ell\) are the network’s output scores (logits). Given labeled training data \(\{(\mathbf{X}_n,
\mathbf{y}_n)\}_{n=1}^{N}\), the parameters minimize the cross-entropy loss with weight penalty,
solved by stochastic gradient descent: gradients are computed on mini-batches by backpropagation and the parameters stepped against them. The penalty \(\lambda \lVert \boldsymbol{\theta}\rVert_2^2\) (weight decay) is one of several regularizers—dropout, data augmentation, and early stopping are others—that the over-parameterized regime makes essential. This is the same empirical-risk-minimization template developed for prediction in Chapter 65; what is special here is only the architecture of \(g_{\boldsymbol{\theta}}\).
Warning
A CNN trained by Equation 45.3 minimizes predictive loss, not the recovery of any causal or structural quantity. Its outputs are calibrated to the training distribution and labels, nothing more. Treating a predicted label or an embedding as if it were a ground-truth measurement—rather than an estimate with distribution-dependent error—is the original sin of images-as-data, and Section 45.5 shows what it costs.
45.3.3 Transfer Learning: Why Marketing Rarely Trains From Scratch
No marketing dataset is large enough to estimate the tens of millions of parameters in a modern CNN from scratch. The field instead relies on transfer learning: take a backbone network pre-trained on a massive general-purpose corpus (canonically ImageNet, with roughly a million labeled images across a thousand object categories), discard its task head, and reuse its learned representation. The canonical backbones trace the field’s progress—the deep CNN that launched it (Krizhevsky, Sutskever, and Hinton 2017), the deeper inception architecture (Szegedy et al. 2015), and the residual networks (He et al. 2016) that remain a default image encoder—and any of them can be downloaded pre-trained and reused. Two modes are common. In feature extraction, the backbone is frozen and the penultimate-layer activations \(\mathbf{z} =
h_{\boldsymbol{\theta}}(\mathbf{X}) \in \mathbb{R}^{d}\)—the embedding—are treated as a fixed, off-the-shelf feature vector fed to a simple downstream model (logistic regression, gradient boosting) trained on the marketing labels. In fine-tuning, the backbone weights are unfrozen and updated, usually with a small learning rate, so the representation adapts to the target domain.
The justification is the hierarchy of Figure 45.1: early- and middle-layer features (edges, textures, parts) are nearly universal across natural images, so they transfer; only the late, task-specific layers must be relearned. The practical rule is to fine-tune more layers the larger and more domain-specific the target data, and freeze more the smaller and more generic. The same logic underlies the brand-image work of Liu, Dzyabura, and Mizik (2020), who train a multi-label convolutional network on consumer-created images to recover perceptual brand attributes (see Section 45.4.1). For most marketing studies—where the labeled sample numbers in the thousands, not millions—frozen-backbone feature extraction is both the safest and the most reproducible choice.
Code
set.seed(34)# Simulate the *output* of a frozen CNN backbone: each image is represented by a# d-dimensional embedding z. In practice z = h_theta(image) from a pretrained# network; here we generate embeddings whose geometry encodes two latent# "visual styles" so a downstream classifier can separate them.d<-16; n_per<-150style_A_mean<-rnorm(d, 0, 1)# e.g., "minimalist product shot"style_B_mean<-style_A_mean+rnorm(d, 0, 1.2)# e.g., "lifestyle scene"Z_A<-matrix(rnorm(n_per*d), n_per, d)+matrix(style_A_mean, n_per, d, byrow =TRUE)Z_B<-matrix(rnorm(n_per*d), n_per, d)+matrix(style_B_mean, n_per, d, byrow =TRUE)Z<-rbind(Z_A, Z_B)y<-factor(rep(c("minimalist", "lifestyle"), each =n_per))# Downstream task: a simple logistic head on the frozen embedding (no CNN# retraining). This is "feature extraction" transfer learning in miniature.df<-data.frame(y =y, Z)idx<-sample(nrow(df), 0.7*nrow(df))fit<-glm(y~., data =df[idx, ], family =binomial)pred<-ifelse(predict(fit, df[-idx, ], type ="response")>0.5,"minimalist", "lifestyle")acc<-mean(pred==df[-idx, "y"])cat("Held-out accuracy of logistic head on frozen embedding:",round(acc, 3), "\n")#> Held-out accuracy of logistic head on frozen embedding: 0.989
The example deliberately separates the expensive, transferable part (the embedding, treated as given) from the cheap, task-specific part (a logistic head the analyst actually estimates)—the division of labor that makes images-as-data feasible in marketing.
45.4 Three Families of Marketing Image Features
Marketing-relevant image features fall into three families—brand, aesthetic, and content—distinguished by the construct they measure and the machinery they require. Table 45.1 summarizes them; the subsections that follow develop each in turn.
Table 45.1: Three families of marketing image features, the construct each targets, and how it is measured.
Family
Construct
Method
Output
Brand
Presence/prominence of brand marks; visual brand identity
Logo detection; learned brand-attribute classifiers
Object labels & counts, scene class, facial affect
45.4.1 Brand Features
Brand features answer two questions: is the brand here, and how prominently? The first is logo detection—an object-detection task that returns bounding boxes and confidence scores for known marks—and it underwrites a quantity of direct managerial interest: the share of visual voice, the fraction of image area (or of impressions) a brand’s marks occupy across a corpus of user- or sponsor-generated images. This matters because brand exposure on social media is increasingly incidental, embedded in content the firm does not control, so counting hashtags understates true exposure while logo detection captures it. The notion of prominence—the conspicuousness of a brand’s mark—carries status-signaling consequences developed in Chapter 11, and image data let it be measured at scale rather than coded by hand. The design of the mark itself is consequential: Luffarelli, Mukesh, and Mahmood (2019) show that descriptive logos (those that visually signal what the brand does) raise brand equity, and Mahmood, Luffarelli, and Mukesh (2019) find that complex visual logo cues shape equity-crowdfunding outcomes—logo features that computer vision can now extract at scale. How consumers depict brands in their own photos is itself a measurable typology: Hartmann et al. (2021a) use computer vision to distinguish “brand selfies” (the consumer in frame with the product) from “packshots” and show the two drive engagement differently.
The second, deeper question is whether an image conveys a brand’s perceptual identity. Liu, Dzyabura, and Mizik (2020) answer it directly: their BrandImageNet trains a multi-label CNN to detect abstract perceptual attributes (e.g., “glamorous,” “rugged,” “fun”) in ordinary consumer-created images, and the machine-derived portrait of a brand’s image tracks survey-based perceptions in near real time. Complementary approaches elicit brand perception by having consumers select images rather than rate scales (Dzyabura and Peres 2021) and mine brand attributes from the social-network structure around a brand (Culotta and Cutler 2016). This is a conceptual advance over logo counting—it measures what a brand means visually, not merely whether it is shown—and it links image data to the brand-meaning construct of Batra (2019), who locate visual cues (advertising, packaging) among the primary sources from which brand associations are built. Crucially, such attribute classifiers are learned, so their outputs are estimates with brand- and context-dependent error; this is exactly the generated-regressor problem formalized below.
45.4.2 Aesthetic Features
Aesthetic features score an image on beauty, professionalism, or arousal potential. Two routes exist. The hand-engineered route assembles the color and complexity statistics of the previous section into an interpretable index grounded in the Berlyne (1960) inverted-U. The learned route trains a CNN to regress human aesthetic ratings onto images, yielding a single continuous “aesthetic score” per image. Each route has its place: learned scores predict human judgment far better but are opaque and may encode spurious correlates of the training raters’ tastes; hand-engineered indices are weaker predictors but transparent and manipulable, which matters when the goal is advice (“raise saturation”) rather than mere ranking.
Aesthetics is not decoration. In advertising, the visual pleasure of an ad is a proximate driver of attention and attitude, mediated by the complexity mechanism of Pieters, Wedel, and Batra (2010) and by the gaze patterns of Pieters, Wedel, and Zhang (2007), and the rhetorical figuration of an image—visual metaphor, pun, and other tropes—shapes persuasion in ways parallel to verbal rhetoric (McQuarrie and Mick 1999). Specific perceptual choices carry meaning: color hue and saturation move brand personality and downstream response (Labrecque and Milne 2012; Labrecque, Patrick, and Milne 2013), and aesthetic styling trades off against perceived functionality rather than improving evaluation monotonically (Hagtvedt and Patrick 2014). In social commerce, image quality is a documented driver of demand: professionally composed product and listing photos lift engagement and conversion, and Zhang et al. (2022) show, using interpretable deep-learning image features on Airbnb listings, which properties of a photo (composition, brightness, depth) actually raise demand. The aesthetic score is therefore a feature with a theory behind it, not a black box to be regressed on outcomes for its own sake.
45.4.3 Content Features
Content features identify what is depicted. Object detection returns the set, count, and location of recognizable objects (a person, a beverage, a beach); scene recognition classifies the overall setting (kitchen, gym, nightclub); face analysis locates faces and, more controversially, infers attributes such as apparent emotion, age, or gaze direction. Content features connect images to substantive marketing theory: the presence of people versus products in a post, the warmth of depicted facial expressions, and the activities shown all map onto constructs—self-presentation, social proof, lifestyle congruence—that the social-media and influencer literatures (Chapter 17) treat as drivers of engagement.
Warning
Face- and demographic-inference features are the most fraught in this chapter. Apparent-emotion and inferred-demographic classifiers are trained on labeled data that encode the annotators’ cultural assumptions; their errors are systematically correlated with skin tone, age, and gender, raising both validity and fairness concerns. The privacy implications of inferring protected attributes from images are governed by the regimes in Chapter 24. Researchers should treat such features as low-confidence proxies, audit error rates by subgroup, and prefer coarse, well-validated labels (face present / absent) over fine-grained inferences (specific emotion, exact age) wherever the theory permits.
45.5 The Econometrics of Generated Image Features
The defining methodological problem of images-as-data is that the features are generated regressors: they are not observed but predicted by a first-stage model (the CNN), and that prediction carries error. Suppose the structural object of interest is
where \(y_i\) is an outcome (clicks, likes, sales), \(f_i^{\ast}\) is the true image feature (true aesthetic appeal, true logo prominence), and \(\mathbf{w}_i\) are controls. The analyst does not observe \(f_i^{\ast}\); the CNN supplies an estimate \(\hat f_i = f_i^{\ast} + u_i\). Regressing \(y_i\) on \(\hat f_i\) instead of \(f_i^{\ast}\) raises three distinct hazards.
Attenuation from classical measurement error. If \(u_i\) is mean-zero noise independent of \(f_i^{\ast}\) and \(\varepsilon_i\), the ordinary-least-squares estimand is biased toward zero by the reliability ratio,
The noisier the classifier (the smaller its reliability), the more the true effect is understated. A nonsignificant coefficient on an image feature can therefore mean “no effect” or “a noisy detector,” and the two are not distinguishable without an estimate of \(\operatorname{Var}(u)\).
Non-classical, correlated error. Worse, CNN error is rarely classical. A logo detector may systematically miss small or occluded logos; an aesthetic regressor may rate professionally lit photos higher and such photos may be posted by firms that also buy promotion. When \(\operatorname{Cov}(u_i, \varepsilon_i) \neq
0\)—because the same unobserved factor (a firm’s production budget) drives both the classifier’s error and the outcome—\(\hat\beta\) is biased in an unknown direction and no reliability correction recovers \(\beta\). This is the image analogue of the endogeneity problems treated throughout Chapter 40, and it is the reason image features must be validated against, and where possible instrumented or held fixed within an experiment.
Forbidden-regression and look-elsewhere risks. Because embeddings are high-dimensional, an analyst can search across hundreds of learned dimensions for one that correlates with the outcome and report it as a “discovered visual driver.” Without pre-registration or a held-out test, this is overfitting dressed as inference. The discipline that text-as-data imposes—split the sample, fix the representation before looking at outcomes, report out-of-sample fit—applies with full force here.
Three remedies recur. First, validate the feature against human labels on a held-out subsample and report the classifier’s accuracy or correlation with ground truth, so the reader can gauge \(\operatorname{Var}(u)\). Second, experimentally manipulate the feature rather than only observe it: the cleanest images-as-data designs randomize which image a user sees (an A/B test on ad creative) so that \(f_i\) is assigned, not estimated from confounded content, which severs \(\operatorname{Cov}(u_i, \varepsilon_i)\) by design. Third, correct or bound the bias—correcting attenuation with an estimated reliability ratio when error is plausibly classical, and otherwise reporting the OLS estimate as a conservative (attenuated) bound on a positive effect. The simulation below makes the attenuation in Equation 45.5 concrete and shows that a reliability correction recovers the true slope when error is classical.
Code
set.seed(34)n<-4000beta_true<-1.5# True (latent) image feature f* and outcome y generated from it.f_star<-rnorm(n, 0, 1)y<-2+beta_true*f_star+rnorm(n, 0, 1)# A noisy CNN measures f* with classical error: reliability rho = Var(f*)/Var(fhat).sigma_u<-0.9f_hat<-f_star+rnorm(n, 0, sigma_u)rho<-var(f_star)/var(f_hat)# estimable from a validation setnaive<-coef(lm(y~f_hat))["f_hat"]# attenuated OLScorrected<-naive/rho# reliability-ratio correctioncat("True slope beta :", beta_true, "\n")#> True slope beta : 1.5cat("Reliability ratio (rho) :", round(rho, 3), "\n")#> Reliability ratio (rho) : 0.556cat("Naive OLS on f_hat :", round(naive, 3)," (attenuated toward 0)\n")#> Naive OLS on f_hat : 0.828 (attenuated toward 0)cat("Reliability-corrected :", round(corrected, 3), "\n")#> Reliability-corrected : 1.49
The naive slope is pulled toward zero by exactly the reliability ratio; dividing by \(\rho\)—which a validation set identifies—restores the truth. The correction works only under classical error; when the detector’s mistakes correlate with the outcome, no such fix exists and design-based identification is the only honest route.
45.6 Applications in Advertising and Social Media
45.6.1 Advertising Creative
The richest application is the measurement and optimization of advertising creative. The visual content of an ad—its complexity, color, brand prominence, and pacing—drives whether viewers attend, remember, and respond, and image (and video-frame) features make these properties measurable at the scale of entire ad libraries. The construct lineage is clear: visual complexity and gaze (Pieters, Wedel, and Zhang (2007); Pieters, Wedel, and Batra (2010)), the arousal-potential account of aesthetic preference (Berlyne (1960)), and the visual sources of brand meaning (Batra (2019)) all become features once a CNN is in the loop. Video advertising extends this from stills to sequences. The dynamics of attention within a video ad—when a brand appears, how scenes are cut, what holds the viewer—predict skipping and recall; T. S. Teixeira, Wedel, and Pieters (2010) show that inducing joy and surprise sustains attention to online video ads, and T. Teixeira, Picard, and el Kaliouby (2014) that emotional and attentional dynamics jointly govern whether viewers watch through or zap. Per-frame image features are precisely how those dynamics are operationalized at scale, a theme picked up in the video-marketing literature (Rajaram and Manchanda (2020); Leyva and Sanchez (2021)) and developed further in Chapter 13. The methodological caveat from Section 45.5 binds hardest here: observational creative comparisons confound the image with the campaign that bought it, so the credible designs hold the audience fixed and randomize the creative.
45.6.2 Social Media and User-Generated Images
On social platforms, the image is the post. Engagement, reach, and brand-relevant outcomes depend on visual content the firm often does not produce, which makes image features the only way to measure exposure and appeal across user-generated corpora. Three measurement problems recur. Incidental brand exposure is captured by logo detection and the share-of-visual-voice it yields, revealing brand presence that text-only listening (Chapter 12) misses entirely. Image-driven engagement is studied by relating content and aesthetic features—people versus products, warmth, professionalism, complexity—to likes, shares, and comments; Li and Xie (2019) show empirically which image-content features raise social-media engagement, and Hartmann et al. (2021b) and Hartmann et al. (2023) develop image-mining pipelines for brand-relevant social-media content and relate visual elements to engagement. Influencer aesthetics connect a creator’s visual style to follower growth and sponsorship value (Chapter 17), with content features quantifying the self-presentation strategies that drive parasocial response. A recurring, sobering finding across this work is that what is depicted (content) and how it looks (aesthetics) carry largely separable effects, so a study that conflates them—loading both onto a single “image quality” score—misattributes one to the other.
Note
The text-and-image fusion frontier. Marketing stimuli are rarely image-only: an ad, a post, or a listing pairs a picture with words. Multimodal models learn a joint embedding in which images and the text that co-occurs with them are mapped to the same space, so a post can be represented by a single vector capturing both. This lets the analyst ask whether the image and caption are congruent—and incongruence is itself a measurable, theory-relevant feature—and connects images-as-data to the text methods of Chapter 43 and the broader machine-learning toolkit of Chapter 65. Witte et al. (2026) show that transformer-based vision-language models can classify marketing images directly from natural-language prompts, often without task-specific training, and the same model class can now generate marketing visuals, raising the question of whether it reaches human-level creative quality (Hartmann, Exner, and Domdey 2025). The fusion of image with text and other modalities is the subject of Chapter 52.
45.7 Pitfalls and Identification
Five recurring failures separate credible image-as-data work from decorative work.
The first is treating predictions as measurements, ignoring the generated- regressor problem of Section 45.5. Always report classifier validation and, where error may be non-classical, prefer experimental assignment of the image.
The second is dataset shift. A backbone pre-trained on ImageNet encodes the visual statistics of web photos circa its training era; applied to a domain it never saw—X-ray packaging, niche product categories, a new platform’s aesthetic—its features may be uninformative or biased. Validate on in-domain data, and fine-tune when the target distribution diverges.
The third is spurious cues and shortcut learning. CNNs latch onto whatever predicts the label in training, including artifacts—a watermark, a background, a camera type—that correlate with the outcome by accident. A classifier that appears to “detect luxury” may have learned to detect studio lighting. Probe what the model actually responds to before interpreting a feature substantively.
The fourth is conflating content and aesthetics, collapsing what is shown and how it looks into one score when the social-media evidence says their effects are separable. Keep the two families of features distinct and let the data assign their effects.
The fifth is ethics and privacy, especially for facial and demographic inference. Subgroup-correlated error and the inference of protected attributes raise fairness and legal exposure (Chapter 24); coarse, validated labels are preferable to fine-grained inferences, and audits by subgroup are not optional.
Underlying all five is a single discipline borrowed from the rest of this book: fix the representation before looking at outcomes, validate the features against ground truth, prefer designs that randomize the image over those that merely observe it, and report the bias that remains. An image feature is a measurement, and like every measurement in Chapter 35 it must be shown to be reliable and valid before its coefficient can be believed. Images are one branch of the broader unstructured-data program (Balducci and Marinova 2018); for a survey focused on the visual channel and its methods, see Dzyabura, El Kihal, and Peres (2021).
Replication resources: image analytics
The hand-engineered color/complexity features and the frozen-embedding transfer-learning demonstration in this chapter run on open R/Python tooling (magick/imager in R; Pillow, OpenCV, and pretrained torchvision/timm backbones in Python). The canonical backbones ship reference implementations—ResNet at github.com/KaimingHe/deep-residual-networks(He et al. 2016); ImageNet-pretrained AlexNet/Inception weights (Krizhevsky, Sutskever, and Hinton 2017; Szegedy et al. 2015) are bundled in every deep-learning framework—and SIFT (Lowe 2004) is in OpenCV. The empirical marketing studies cited here (e.g., Liu, Dzyabura, and Mizik (2020), Hartmann et al. (2021a), Zhang et al. (2022)) generally rely on proprietary image corpora; confirm any code/data release on the article page rather than assuming one.
45.8 Key Takeaways
An image is a high-dimensional tensor (Equation 45.1); useful image analysis replaces raw pixels with learned or hand-engineered features whose dimensions a marketing theory can interpret.
The convolutional network (Equation 45.2) builds locality and translation structure in by construction; in marketing it is almost always used via transfer learning, with a frozen pre-trained backbone supplying an embedding to a small downstream model.
Marketing image features divide into brand (logo and perceptual identity, as in Liu, Dzyabura, and Mizik (2020)), aesthetic (beauty and arousal potential, grounded in Berlyne (1960)), and content (objects, scenes, faces) families.
Image features are generated regressors: classical error attenuates effects by the reliability ratio (Equation 45.5), and non-classical, outcome-correlated error biases them unpredictably. Validate, and prefer designs that randomize the image.
The strongest applications—advertising creative and social-media engagement—pair image features with experimental assignment of the visual stimulus, severing the confound between what an image looks like and the campaign that produced it.
Balducci, Bitty, and Detelina Marinova. 2018. “Unstructured Data in Marketing.”Journal of the Academy of Marketing Science 46 (4): 557–90. https://doi.org/10.1007/s11747-018-0581-x.
Batra, Rajeev. 2019. “Creating Brand Meaning: A Review and Research Agenda.”Journal of Consumer Psychology 29 (3): 535–46. https://doi.org/10.1002/jcpy.1122.
Berlyne, D. E. 1960. “Toward a Theory of Exploratory Behavior: I. Arousal and Drive.” In Conflict, Arousal, and Curiosity., 163–92. McGraw-Hill Book Company. https://doi.org/10.1037/11164-007.
Culotta, Aron, and Jennifer Cutler. 2016. “Mining Brand Perceptions from Twitter Social Networks.”Marketing Science 35 (3): 343–62. https://doi.org/10.1287/mksc.2015.0968.
Dzyabura, Daria, Siham El Kihal, and Renana Peres. 2021. “Image Analytics in Marketing.” In Handbook of Market Research, 665–92. Springer. https://doi.org/10.1007/978-3-319-57413-4_38.
Hagtvedt, Henrik, and Vanessa M. Patrick. 2014. “Consumer Response to Overstyling: Balancing Aesthetics and Functionality in Product Design.”Psychology and Marketing 31 (7): 518–25. https://doi.org/10.1002/mar.20713.
Hartmann, Jochen, Yannick Exner, and Samuel Domdey. 2025. “The Power of Generative Marketing: Can Generative AI Create or Reach Human-Level Visual Marketing Content?”International Journal of Research in Marketing 42 (1): 13–31. https://doi.org/10.1016/j.ijresmar.2024.09.002.
Hartmann, Jochen, Mark Heitmann, Christina Schamp, and Oded Netzer. 2021b. “The Power of Brand Selfies.”Journal of Marketing Research 58 (6): 1159–77.
Hartmann, Jochen, Mark Heitmann, Christian Siebert, and Christina Schamp. 2023. “More Than a Feeling: Accuracy and Application of Sentiment Analysis.”International Journal of Research in Marketing 40 (1): 75–87.
He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. “Deep Residual Learning for Image Recognition.” In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770–78. https://doi.org/10.1109/cvpr.2016.90.
Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. 2017. “ImageNet Classification with Deep Convolutional Neural Networks.”Communications of the ACM 60 (6): 84–90. https://doi.org/10.1145/3065386.
Labrecque, Lauren I., and George R. Milne. 2012. “Exciting Red and Competent Blue: The Importance of Color in Marketing.”Journal of the Academy of Marketing Science 40 (5): 711–27. https://doi.org/10.1007/s11747-010-0245-y.
Labrecque, Lauren I., Vanessa M. Patrick, and George R. Milne. 2013. “The Marketers’ Prismatic Palette: A Review of Color Research and Future Directions.”Psychology and Marketing 30 (2): 187–202. https://doi.org/10.1002/mar.20597.
Leyva, Roberto, and Victor Sanchez. 2021. “Video Memorability Prediction via Late Fusion of Deep Multi-Modal Features.” In 2021 IEEE International Conference on Image Processing (ICIP), 2488–92. IEEE.
Li, Yiyi, and Ying Xie. 2019. “Is a Picture Worth a Thousand Words? An Empirical Study of Image Content and Social Media Engagement.”Journal of Marketing Research 57 (1): 1–19. https://doi.org/10.1177/0022243719881113.
Liu, Liu, Daria Dzyabura, and Natalie Mizik. 2020. “Visual Listening In: Extracting Brand Image Portrayed on Social Media.”Marketing Science 39 (4): 669–86. https://doi.org/10.1287/mksc.2020.1226.
Luffarelli, Jonathan, Mudra Mukesh, and Ammara Mahmood. 2019. “Let the Logo Do the Talking: The Influence of Logo Descriptiveness on Brand Equity.”Journal of Marketing Research 56 (5): 862–78. https://doi.org/10.1177/0022243719845000.
Mahmood, Ammara, Jonathan Luffarelli, and Mudra Mukesh. 2019. “What’s in a Logo? The Impact of Complex Visual Cues in Equity Crowdfunding.”Journal of Business Venturing 34 (1): 41–62. https://doi.org/10.1016/j.jbusvent.2018.09.006.
McQuarrie, Edward F., and David Glen Mick. 1999. “Visual Rhetoric in Advertising: Text-Interpretive, Experimental, and Reader-Response Analyses.”Journal of Consumer Research 26 (1): 37–54. https://doi.org/10.1086/209549.
Pieters, Rik, Michel Wedel, and Rajeev Batra. 2010. “The Stopping Power of Advertising: Measures and Effects of Visual Complexity.”Journal of Marketing 74 (5): 48–60. https://doi.org/10.1509/jmkg.74.5.48.
Pieters, Rik, Michel Wedel, and Jie Zhang. 2007. “Optimal Feature Advertising Design Under Competitive Clutter.”Management Science 53 (11): 1815–28. https://doi.org/10.1287/mnsc.1070.0732.
Rajaram, Prashant, and Puneet Manchanda. 2020. “Video Influencers: Unboxing the Mystique.”arXiv Preprint arXiv:2012.12311.
Szegedy, Christian, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. “Going Deeper with Convolutions.” In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1–9. https://doi.org/10.1109/cvpr.2015.7298594.
Teixeira, Thales S., Michel Wedel, and Rik Pieters. 2010. “Moment-to-Moment Optimal Branding in TV Commercials: Preventing Avoidance by Pulsing.”Marketing Science 29 (5): 783–804. https://doi.org/10.1287/mksc.1100.0567.
Teixeira, Thales, Rosalind Picard, and Rana el Kaliouby. 2014. “Why, When, and How Much to Entertain Consumers in Advertisements? A Web-Based Facial Tracking Field Study.”Marketing Science 33 (6): 809–27. https://doi.org/10.1287/mksc.2014.0854.
Witte, Maximilian, Mark Heitmann, Jochen Hartmann, and Keno Tetzlaff. 2026. “Language of Images: Classifying Marketing Images with Transformers and Vision Language Models.”International Journal of Research in Marketing, January. https://doi.org/10.1016/j.ijresmar.2026.01.001.
Zeiler, Matthew D., and Rob Fergus. 2014. “Visualizing and Understanding Convolutional Networks.” In Computer Vision – ECCV 2014, 818–33. Lecture Notes in Computer Science. https://doi.org/10.1007/978-3-319-10590-1_53.
Zhang, Shunyuan, Dokyun Lee, Param Vir Singh, and Kannan Srinivasan. 2022. “What Makes a Good Image? Airbnb Demand Analytics Leveraging Interpretable Image Features.”Management Science 68 (8): 5644–66. https://doi.org/10.1287/mnsc.2021.4175.
# Images as Data {#sec-image-processing}Marketing has always been a visual discipline. A logo, a package, anadvertisement, an influencer's selfie, a product photograph on a retail site—eachis a deliberately constructed image meant to move attention, belief, and demand.Until recently those images were data only in a metaphorical sense: researchers*looked* at them, coded them by hand, and ran small experiments on a handful ofmanipulated stimuli. What has changed is that an image can now be turned into*numbers*—high-dimensional, machine-readable features—at the scale of millions ofphotographs, and those numbers can be entered into the same demand systems, choicemodels, and regressions that the rest of this book develops. This chapter is abouthow to do that responsibly: how to extract brand, aesthetic, and content featuresfrom images, what the deep-learning machinery underneath actually computes, andhow to deploy the resulting features in advertising and social-media researchwithout fooling oneself.The intellectual move mirrors the one made for text (@sec-text-as-data). There,unstructured language becomes a document–term matrix or a sequence of embeddings;here, an unstructured raster of pixels becomes a feature vector. In both cases therepresentation is *lossy* and *learned*, and in both cases the central empiricaldanger is the same: the features are generated by a model whose errors arecorrelated with the very outcomes the researcher wants to explain, so naiveregression confounds measurement with effect. The payoff for getting it right islarge. Image features let the analyst measure constructs that were previouslylocked inside qualitative judgment—a brand's visual identity, an ad's aestheticappeal, the "warmth" of a product photo—and relate them to clicks, engagement,sales, and firm value.The chapter proceeds from pixels to constructs to applications. It first fixeswhat an image *is* as a mathematical object and what we mean by an image feature.It then develops the workhorse of modern computer vision—the convolutional neuralnetwork—giving its estimator, its loss, and the assumptions under which a featureextracted from it is meaningful. With that in hand it turns to the three familiesof marketing-relevant features (brand, aesthetic, content) and to theeconometrics of using generated image features as regressors. It closes withapplications to advertising and social media, and with the identificationpitfalls that separate a credible image-as-data study from a decorative one.## What an Image IsFormally, a digital image is a function sampled on a grid. A color image of height$H$ and width $W$ is a third-order tensor$$\mathbf{X} \in [0,1]^{H \times W \times C}, \qquad C = 3,$$ {#eq-image-tensor}where the three channels $C$ hold red, green, and blue intensities and each entry$\mathbf{X}_{ijc}$ is a normalized pixel value. A modest $224 \times 224$ RGBimage—the canonical input size for many vision models—already lives in a space ofdimension $224 \times 224 \times 3 \approx 1.5 \times 10^{5}$. This is the**curse of dimensionality** in its rawest form: the number of pixels vastlyexceeds the number of labeled examples in any marketing dataset, and pixels areindividually almost meaningless. The pixel at position $(112, 60)$ carries nostable interpretation; what matters is the *arrangement* of pixels into edges,textures, objects, and scenes.Two properties of images dictate everything about how they are modeled. First,**locality**: meaningful structure (an edge, a corner) is built from nearbypixels, so a useful representation should aggregate local neighborhoods beforeglobal ones. Second, **translation structure**: a logo is the same logo whether itsits in the top-left or the center of the frame, so a useful representation shouldrespond similarly to a pattern regardless of where it appears. A model thatignored these properties—say, a fully connected network treating each pixel as anunrelated input—would have to relearn "what an edge looks like" separately at everylocation and would need astronomically more data to do so. The architectures thatdominate computer vision are precisely those that build locality and translationstructure in by construction.::: {.callout-note}An **image feature** is any function $f: [0,1]^{H \times W \times C} \to\mathbb{R}^{d}$ that maps a raw image to a lower-dimensional vector intended tocapture a construct of interest. Features range from hand-engineered andinterpretable (mean saturation, number of detected faces, fraction of the frameoccupied by a brand logo) to learned and opaque (the 2{,}048-dimensionalpenultimate-layer activations of a deep network). The art of images-as-data ischoosing features whose dimensions a marketing theory can speak about.:::## Classical Features: Color, Composition, and Hand-EngineeringBefore deep learning, vision in marketing relied on **hand-engineered features**:quantities a researcher computes with an explicit formula and can defend to areferee line by line. They remain valuable precisely because they are transparent,and several map directly onto long-standing aesthetic theory.Color is the most tractable. RGB is poor for human-meaningful description becauseits axes (red, green, blue intensity) do not correspond to how people talk aboutcolor, so analysts transform to the **HSV** space (hue, saturation, value), inwhich *hue* is the dominant wavelength, *saturation* the colorfulness, and *value*the brightness. From an image one can compute the mean and dispersion of eachchannel, the share of warm versus cool hues, and the **colorfulness** index, andrelate these to response. This connects to a classical account of aestheticpreference: @Berlyne_1960 argued that hedonic value is an inverted-U function of*arousal potential*—stimuli that are too simple bore and too complex overwhelm,with moderate complexity, novelty, and contrast preferred. Color statistics,visual entropy, and edge density are all operationalizations of arousal potential,and the inverted-U is a recurring empirical shape in this literature.Composition features quantify *where* content sits and *how much* of it there is.**Visual complexity**—the amount and variety of detail in an image—has a longpedigree in advertising research, where @pieters2010 distinguish *featurecomplexity* (irregular, dense visual elements) from *design complexity* (theelaborateness of the deliberate arrangement) and show the two have oppositeeffects on attention and attitude: feature complexity hurts brand attention whiledesign complexity helps it. Earlier, @pieters2007 established that the**eye-trackable** structure of an ad—how gaze is distributed across the brand,the pictorial, and the text elements—predicts memory. Visual complexity iscommonly proxied by file size after compression, by edge density from a Sobel orCanny filter, or by the entropy of the color histogram. The point of cataloguingthese is not nostalgia: hand-engineered features are still the right tool when theconstruct is simple, the sample is small, or interpretability is paramount, andthey make excellent *controls* alongside learned features.A compact illustration computes interpretable color and complexity features for asynthetic image and shows how they vary with content.```{r image-handcrafted-features, message=FALSE, warning=FALSE}set.seed(34)# Build three synthetic 64x64 RGB images with known properties:# (a) a calm, low-saturation gray scene; (b) a vivid warm scene;# (c) a high-complexity noisy scene. Each is an array [H, W, C] in [0,1].make_image <-function(base, noise_sd) { arr <-array(base, dim =c(64, 64, 3)) arr <- arr +array(rnorm(64*64*3, 0, noise_sd), dim =dim(arr))pmin(pmax(arr, 0), 1)}img_calm <-make_image(c(0.55, 0.55, 0.55), 0.02) # near-gray, low noiseimg_warm <-make_image(c(0.85, 0.35, 0.15), 0.04) # warm (red/orange)img_busy <-make_image(c(0.50, 0.50, 0.50), 0.25) # high-variance "busy"# RGB -> HSV saturation and a colorfulness proxy (Hasler-Susstrunk style).sat_value <-function(im) { mx <-pmax(im[, , 1], im[, , 2], im[, , 3]) mn <-pmin(im[, , 1], im[, , 2], im[, , 3])mean(ifelse(mx >0, (mx - mn) / mx, 0)) # mean saturation}colorfulness <-function(im) { rg <- im[, , 1] - im[, , 2] yb <-0.5* (im[, , 1] + im[, , 2]) - im[, , 3]sqrt(sd(rg)^2+sd(yb)^2) +0.3*sqrt(mean(rg)^2+mean(yb)^2)}# Edge density as a translation-invariant complexity proxy: mean absolute# horizontal+vertical gradient of the luminance channel.edge_density <-function(im) { lum <-0.299* im[, , 1] +0.587* im[, , 2] +0.114* im[, , 3] gx <-abs(lum[-1, ] - lum[-nrow(lum), ]) gy <-abs(lum[, -1] - lum[, -ncol(lum)])mean(gx) +mean(gy)}features <-data.frame(image =c("calm", "warm", "busy"),saturation =c(sat_value(img_calm), sat_value(img_warm), sat_value(img_busy)),colorfulness =c(colorfulness(img_calm), colorfulness(img_warm), colorfulness(img_busy)),edge_density =c(edge_density(img_calm), edge_density(img_busy), edge_density(img_busy)))features[, -1] <-round(features[, -1], 3)knitr::kable(features, caption ="Interpretable color and complexity features for three synthetic images.")```The warm image scores high on saturation and colorfulness; the busy image scoreshigh on edge density—the digital analogue of @Berlyne_1960's arousal potential.These features are cheap, reproducible, and interpretable, but they are blind to*meaning*: they cannot tell a logo from a face. For meaning, the field turns tolearned representations.## Deep Representations: The Convolutional Neural NetworkThe central tool of modern computer vision is the **convolutional neural network**(CNN). Its design is a direct response to the two properties of images namedabove. Rather than connect every pixel to every hidden unit, a CNN slides smalllearnable filters across the image, so each unit sees only a local neighborhood(locality) and the *same* filter is applied at every position (translationstructure, and an enormous reduction in parameters).### The Convolution OperationA **convolutional layer** applies a bank of small filters (kernels) to its input.Let the input be a feature map $\mathbf{Z} \in \mathbb{R}^{H \times W \timesC_{\text{in}}}$ and let a single filter be $\mathbf{K} \in \mathbb{R}^{k \times k\times C_{\text{in}}}$ with bias $b$. The output at spatial location $(i,j)$ is$$(\mathbf{Z} * \mathbf{K})_{ij}= \sigma\!\left(b + \sum_{u=1}^{k}\sum_{v=1}^{k}\sum_{c=1}^{C_{\text{in}}}\mathbf{K}_{uvc}\,\mathbf{Z}_{\,i+u,\,j+v,\,c}\right),$$ {#eq-convolution}where $\sigma(\cdot)$ is a nonlinearity, almost always the **rectified linearunit** $\sigma(z) = \max(0, z)$. A layer holds $C_{\text{out}}$ such filters,producing an output tensor with $C_{\text{out}}$ channels. Three properties make@eq-convolution the right primitive. **Parameter sharing**: one filter, with$k^2 C_{\text{in}} + 1$ parameters, is reused at every location, so a layer learns"detect this pattern anywhere" rather than memorizing locations. **Localconnectivity**: each output depends only on a $k \times k$ window, encodinglocality. **Translation equivariance**: shifting the input shifts the outputidentically, so a logo detector fires wherever the logo appears.Convolutional layers are interleaved with **pooling** layers that downsample—mostcommonly *max pooling*, which reports the maximum activation in each smallwindow—building tolerance to small shifts and shrinking the spatial resolutionwhile the channel dimension grows. Stacking these operations yields arepresentational hierarchy that has been verified empirically: early layersrespond to oriented edges and color blobs, middle layers to textures and motifs,and late layers to object parts and whole objects [@zeiler2014visualizing]. @fig-cnn-pipelinesketches the pipeline from raw pixels to a task head.```{mermaid}%%| label: fig-cnn-pipeline%%| fig-cap: "A convolutional network as a feature extractor with a task-specific head. Early layers encode generic low-level structure; deep layers encode semantic content. Marketing applications usually freeze the backbone and read off the penultimate-layer embedding."flowchart LR A["Raw image<br/>H × W × 3"] --> B["Conv + ReLU<br/>(edges, color)"] B --> C["Pool<br/>(downsample)"] C --> D["Conv + ReLU<br/>(textures, motifs)"] D --> E["Conv + ReLU<br/>(object parts)"] E --> F["Embedding z ∈ ℝ^d<br/>(penultimate layer)"] F --> G["Task head"] G --> H1["Classification<br/>(brand present?)"] G --> H2["Detection<br/>(where is the logo?)"] G --> H3["Regression<br/>(aesthetic score)"]```### The Estimator and Its LossA CNN with parameters $\boldsymbol{\theta}$ (all filter weights and biases) definesa map $g_{\boldsymbol{\theta}}: \mathbf{X} \mapsto \hat{\mathbf{y}}$. For aclassification task with $L$ labels (e.g., "contains a car," "contains a dog"), thefinal layer produces a probability vector via the **softmax**,$\hat{p}_\ell = \exp(s_\ell)/\sum_{m}\exp(s_m)$, where $s_\ell$ are the network'soutput scores (logits). Given labeled training data $\{(\mathbf{X}_n,\mathbf{y}_n)\}_{n=1}^{N}$, the parameters minimize the **cross-entropy** loss withweight penalty,$$\hat{\boldsymbol{\theta}}= \arg\min_{\boldsymbol{\theta}}\;-\frac{1}{N}\sum_{n=1}^{N}\sum_{\ell=1}^{L}y_{n\ell}\,\log \hat{p}_{n\ell}(\boldsymbol{\theta})\;+\; \lambda \lVert \boldsymbol{\theta} \rVert_2^2 ,$$ {#eq-crossentropy}solved by **stochastic gradient descent**: gradients are computed onmini-batches by backpropagation and the parameters stepped against them. Thepenalty $\lambda \lVert \boldsymbol{\theta}\rVert_2^2$ (weight decay) is one ofseveral regularizers—dropout, data augmentation, and early stopping areothers—that the over-parameterized regime makes essential. This is the sameempirical-risk-minimization template developed for prediction in @sec-ai-ml; whatis special here is only the architecture of $g_{\boldsymbol{\theta}}$.::: {.callout-warning}A CNN trained by @eq-crossentropy minimizes *predictive* loss, not the recovery ofany causal or structural quantity. Its outputs are calibrated to the trainingdistribution and labels, nothing more. Treating a predicted label or an embeddingas if it were a ground-truth measurement—rather than an estimate withdistribution-dependent error—is the original sin of images-as-data, and@sec-image-econometrics shows what it costs.:::### Transfer Learning: Why Marketing Rarely Trains From ScratchNo marketing dataset is large enough to estimate the tens of millions ofparameters in a modern CNN from scratch. The field instead relies on **transferlearning**: take a backbone network pre-trained on a massive general-purposecorpus (canonically ImageNet, with roughly a million labeled images across athousand object categories), discard its task head, and reuse its learnedrepresentation. The canonical backbones trace the field's progress—the deep CNN thatlaunched it [@krizhevsky2017imagenet], the deeper inception architecture[@szegedy2015googlenet], and the residual networks [@he2016resnet] that remain a defaultimage encoder—and any of them can be downloaded pre-trained and reused. Two modes are common. In **feature extraction**, the backbone is*frozen* and the penultimate-layer activations $\mathbf{z} =h_{\boldsymbol{\theta}}(\mathbf{X}) \in \mathbb{R}^{d}$—the **embedding**—aretreated as a fixed, off-the-shelf feature vector fed to a simple downstream model(logistic regression, gradient boosting) trained on the marketing labels. In**fine-tuning**, the backbone weights are *unfrozen* and updated, usually with asmall learning rate, so the representation adapts to the target domain.The justification is the hierarchy of @fig-cnn-pipeline: early- and middle-layerfeatures (edges, textures, parts) are nearly universal across natural images, sothey transfer; only the late, task-specific layers must be relearned. Thepractical rule is to *fine-tune more layers the larger and more domain-specificthe target data, and freeze more the smaller and more generic*. The same logicunderlies the brand-image work of @liu2020, who train a multi-labelconvolutional network on consumer-created images to recover perceptual brandattributes (see @sec-image-brand). For most marketing studies—where the labeledsample numbers in the thousands, not millions—frozen-backbone feature extractionis both the safest and the most reproducible choice.```{r transfer-embedding, message=FALSE, warning=FALSE}set.seed(34)# Simulate the *output* of a frozen CNN backbone: each image is represented by a# d-dimensional embedding z. In practice z = h_theta(image) from a pretrained# network; here we generate embeddings whose geometry encodes two latent# "visual styles" so a downstream classifier can separate them.d <-16; n_per <-150style_A_mean <-rnorm(d, 0, 1) # e.g., "minimalist product shot"style_B_mean <- style_A_mean +rnorm(d, 0, 1.2) # e.g., "lifestyle scene"Z_A <-matrix(rnorm(n_per * d), n_per, d) +matrix(style_A_mean, n_per, d, byrow =TRUE)Z_B <-matrix(rnorm(n_per * d), n_per, d) +matrix(style_B_mean, n_per, d, byrow =TRUE)Z <-rbind(Z_A, Z_B)y <-factor(rep(c("minimalist", "lifestyle"), each = n_per))# Downstream task: a simple logistic head on the frozen embedding (no CNN# retraining). This is "feature extraction" transfer learning in miniature.df <-data.frame(y = y, Z)idx <-sample(nrow(df), 0.7*nrow(df))fit <-glm(y ~ ., data = df[idx, ], family = binomial)pred <-ifelse(predict(fit, df[-idx, ], type ="response") >0.5,"minimalist", "lifestyle")acc <-mean(pred == df[-idx, "y"])cat("Held-out accuracy of logistic head on frozen embedding:",round(acc, 3), "\n")```The example deliberately separates the expensive, transferable part (theembedding, treated as given) from the cheap, task-specific part (a logistic headthe analyst actually estimates)—the division of labor that makes images-as-datafeasible in marketing.## Three Families of Marketing Image FeaturesMarketing-relevant image features fall into three families—**brand**,**aesthetic**, and **content**—distinguished by the construct they measure and themachinery they require. @tbl-feature-families summarizes them; the subsections thatfollow develop each in turn.```{r}#| label: tbl-feature-families#| tbl-cap: "Three families of marketing image features, the construct each targets, and how it is measured."#| message: false#| warning: false#| echo: falseff <-data.frame(Family =c("Brand", "Aesthetic", "Content"),Construct =c("Presence/prominence of brand marks; visual brand identity","Beauty, professionalism, arousal potential, style","Objects, scenes, faces, emotions, activities depicted"),Method =c("Logo detection; learned brand-attribute classifiers","Aesthetic regressors; hand-engineered color/complexity","Object detection; scene & face recognition"),Output =c("Logo box, area share, brand-attribute scores","Continuous aesthetic / arousal score","Object labels & counts, scene class, facial affect"))knitr::kable(ff)```### Brand Features {#sec-image-brand}Brand features answer two questions: *is the brand here, and how prominently?* Thefirst is **logo detection**—an object-detection task that returns bounding boxesand confidence scores for known marks—and it underwrites a quantity of directmanagerial interest: the **share of visual voice**, the fraction of image area (orof impressions) a brand's marks occupy across a corpus of user- orsponsor-generated images. This matters because brand exposure on social media isincreasingly *incidental*, embedded in content the firm does not control, socounting hashtags understates true exposure while logo detection captures it. Thenotion of **prominence**—the conspicuousness of a brand's mark—carriesstatus-signaling consequences developed in @sec-branding, and image data let it bemeasured at scale rather than coded by hand. The *design* of the mark itself isconsequential: @luffarelli2019logo show that descriptive logos (those that visuallysignal what the brand does) raise brand equity, and @mahmood2019logo find that complexvisual logo cues shape equity-crowdfunding outcomes---logo features that computer visioncan now extract at scale. How consumers depict brands in their own photos is itself ameasurable typology: @hartmann2021brandselfies use computer vision to distinguish"brand selfies" (the consumer in frame with the product) from "packshots" and show thetwo drive engagement differently.The second, deeper question is whether an image conveys a brand's *perceptualidentity*. @liu2020 answer it directly: their BrandImageNet trains a multi-labelCNN to detect abstract perceptual attributes (e.g., "glamorous," "rugged,""fun") in ordinary consumer-created images, and the machine-derived portrait of abrand's image tracks survey-based perceptions in near real time. Complementary approacheselicit brand perception by having consumers select images rather than rate scales[@dzyabura2021visual] and mine brand attributes from the social-network structure around abrand [@culotta2016]. This is aconceptual advance over logo counting—it measures what a brand *means* visually,not merely whether it is shown—and it links image data to the brand-meaningconstruct of @batra2019, who locate visual cues (advertising, packaging) among theprimary sources from which brand associations are built. Crucially, such attributeclassifiers are learned, so their outputs are estimates with brand- andcontext-dependent error; this is exactly the generated-regressor problemformalized below.### Aesthetic FeaturesAesthetic features score an image on beauty, professionalism, or arousal potential.Two routes exist. The **hand-engineered** route assembles the color and complexitystatistics of the previous section into an interpretable index grounded in the@Berlyne_1960 inverted-U. The **learned** route trains a CNN to regress humanaesthetic ratings onto images, yielding a single continuous "aesthetic score" perimage. Each route has its place: learned scores predict human judgment far betterbut are opaque and may encode spurious correlates of the training raters'tastes; hand-engineered indices are weaker predictors but transparent andmanipulable, which matters when the goal is *advice* ("raise saturation") ratherthan mere ranking.Aesthetics is not decoration. In advertising, the visual *pleasure* of an ad is aproximate driver of attention and attitude, mediated by the complexity mechanism of@pieters2010 and by the gaze patterns of @pieters2007, and the rhetorical *figuration*of an image—visual metaphor, pun, and other tropes—shapes persuasion in ways parallel toverbal rhetoric [@mcquarrie1999visualrhetoric]. Specific perceptual choices carry meaning:color hue and saturation move brand personality and downstream response [@labrecque2012red;@labrecque2013palette], and aesthetic styling trades off against perceived functionalityrather than improving evaluation monotonically [@hagtvedt2014overstyling]. In socialcommerce, image quality is a documented driver of demand: professionally composed productand listing photos lift engagement and conversion, and @zhang2022goodimage show, usinginterpretable deep-learning image features on Airbnb listings, *which* properties of aphoto (composition, brightness, depth) actually raise demand. Theaesthetic score is therefore a feature with a theory behind it, not a black box tobe regressed on outcomes for its own sake.### Content FeaturesContent features identify *what is depicted*. **Object detection** returns theset, count, and location of recognizable objects (a person, a beverage, a beach);**scene recognition** classifies the overall setting (kitchen, gym, nightclub);**face analysis** locates faces and, more controversially, infers attributes suchas apparent emotion, age, or gaze direction. Content features connect images tosubstantive marketing theory: the presence of *people* versus *products* in a post,the warmth of depicted facial expressions, and the activities shown all map ontoconstructs—self-presentation, social proof, lifestyle congruence—that thesocial-media and influencer literatures (@sec-influencer-marketing) treat asdrivers of engagement.::: {.callout-warning}Face- and demographic-inference features are the most fraught in this chapter.Apparent-emotion and inferred-demographic classifiers are trained on labeled datathat encode the annotators' cultural assumptions; their errors are systematicallycorrelated with skin tone, age, and gender, raising both validity and fairnessconcerns. The privacy implications of inferring protected attributes from imagesare governed by the regimes in @sec-privacy. Researchers should treat suchfeatures as low-confidence proxies, audit error rates by subgroup, and prefercoarse, well-validated labels (face present / absent) over fine-grained inferences(specific emotion, exact age) wherever the theory permits.:::## The Econometrics of Generated Image Features {#sec-image-econometrics}The defining methodological problem of images-as-data is that the features are**generated regressors**: they are not observed but *predicted* by a first-stagemodel (the CNN), and that prediction carries error. Suppose the structural objectof interest is$$y_i = \alpha + \beta\, f_i^{\ast} + \mathbf{w}_i^{\top}\boldsymbol{\gamma}+ \varepsilon_i,$$ {#eq-structural-image}where $y_i$ is an outcome (clicks, likes, sales), $f_i^{\ast}$ is the *true* imagefeature (true aesthetic appeal, true logo prominence), and $\mathbf{w}_i$ arecontrols. The analyst does not observe $f_i^{\ast}$; the CNN supplies an estimate$\hat f_i = f_i^{\ast} + u_i$. Regressing $y_i$ on $\hat f_i$ instead of$f_i^{\ast}$ raises three distinct hazards.**Attenuation from classical measurement error.** If $u_i$ is mean-zero noiseindependent of $f_i^{\ast}$ and $\varepsilon_i$, the ordinary-least-squaresestimand is biased toward zero by the reliability ratio,$$\operatorname{plim}\hat\beta_{\text{OLS}}= \beta \cdot \frac{\operatorname{Var}(f^{\ast})}{\operatorname{Var}(f^{\ast}) + \operatorname{Var}(u)} \;<\; \beta .$$ {#eq-34-attenuation}The noisier the classifier (the smaller its reliability), the more the true effectis understated. A nonsignificant coefficient on an image feature can therefore mean"no effect" *or* "a noisy detector," and the two are not distinguishable without anestimate of $\operatorname{Var}(u)$.**Non-classical, correlated error.** Worse, CNN error is rarely classical. A logodetector may systematically miss small or occluded logos; an aesthetic regressormay rate professionally lit photos higher *and* such photos may be posted byfirms that also buy promotion. When $\operatorname{Cov}(u_i, \varepsilon_i) \neq0$—because the same unobserved factor (a firm's production budget) drives both theclassifier's error and the outcome—$\hat\beta$ is biased in an *unknown* directionand no reliability correction recovers $\beta$. This is the image analogue of theendogeneity problems treated throughout @sec-causal-inference, and it is thereason image features must be validated against, and where possible instrumented orheld fixed within an experiment.**Forbidden-regression and look-elsewhere risks.** Because embeddings arehigh-dimensional, an analyst can search across hundreds of learned dimensions forone that correlates with the outcome and report it as a "discovered visual driver."Without pre-registration or a held-out test, this is overfitting dressed asinference. The discipline that text-as-data imposes—split the sample, fix therepresentation before looking at outcomes, report out-of-sample fit—applies withfull force here.Three remedies recur. First, **validate the feature** against human labels on aheld-out subsample and report the classifier's accuracy or correlation withground truth, so the reader can gauge $\operatorname{Var}(u)$. Second,**experimentally manipulate** the feature rather than only observe it: the cleanestimages-as-data designs randomize which image a user sees (an A/B test on adcreative) so that $f_i$ is assigned, not estimated from confounded content, whichsevers $\operatorname{Cov}(u_i, \varepsilon_i)$ by design. Third, **correct orbound** the bias—correcting attenuation with an estimated reliability ratio whenerror is plausibly classical, and otherwise reporting the OLS estimate as aconservative (attenuated) bound on a positive effect. The simulation below makesthe attenuation in @eq-34-attenuation concrete and shows that a reliabilitycorrection recovers the true slope when error is classical.```{r generated-regressor-attenuation, message=FALSE, warning=FALSE}set.seed(34)n <-4000beta_true <-1.5# True (latent) image feature f* and outcome y generated from it.f_star <-rnorm(n, 0, 1)y <-2+ beta_true * f_star +rnorm(n, 0, 1)# A noisy CNN measures f* with classical error: reliability rho = Var(f*)/Var(fhat).sigma_u <-0.9f_hat <- f_star +rnorm(n, 0, sigma_u)rho <-var(f_star) /var(f_hat) # estimable from a validation setnaive <-coef(lm(y ~ f_hat))["f_hat"] # attenuated OLScorrected <- naive / rho # reliability-ratio correctioncat("True slope beta :", beta_true, "\n")cat("Reliability ratio (rho) :", round(rho, 3), "\n")cat("Naive OLS on f_hat :", round(naive, 3)," (attenuated toward 0)\n")cat("Reliability-corrected :", round(corrected, 3), "\n")```The naive slope is pulled toward zero by exactly the reliability ratio; dividing by$\rho$—which a validation set identifies—restores the truth. The correction works*only* under classical error; when the detector's mistakes correlate with theoutcome, no such fix exists and design-based identification is the only honestroute.## Applications in Advertising and Social Media### Advertising CreativeThe richest application is the measurement and optimization of advertising**creative**. The visual content of an ad—its complexity, color, brand prominence,and pacing—drives whether viewers attend, remember, and respond, and image (andvideo-frame) features make these properties measurable at the scale of entire adlibraries. The construct lineage is clear: visual complexity and gaze(@pieters2007; @pieters2010), the arousal-potential account of aestheticpreference (@Berlyne_1960), and the visual sources of brand meaning (@batra2019)all become *features* once a CNN is in the loop. Video advertising extends thisfrom stills to sequences. The dynamics of attention within a video ad—when a brandappears, how scenes are cut, what holds the viewer—predict skipping and recall;@teixeira2010 show that inducing *joy* and *surprise* sustains attention toonline video ads, and @teixeira2014 that emotional and attentional dynamics jointlygovern whether viewers watch through or zap. Per-frame image features are preciselyhow those dynamics are operationalized at scale, a theme picked up in thevideo-marketing literature (@rajaram2020video; @leyva2021video) and developedfurther in @sec-advertising. The methodological caveat from @sec-image-econometricsbinds hardest here: observational creative comparisons confound the image with thecampaign that bought it, so the credible designs hold the audience fixed andrandomize the creative.### Social Media and User-Generated ImagesOn social platforms, the image *is* the post. Engagement, reach, andbrand-relevant outcomes depend on visual content the firm often does not produce,which makes image features the only way to measure exposure and appeal acrossuser-generated corpora. Three measurement problems recur. **Incidental brandexposure** is captured by logo detection and the share-of-visual-voice it yields,revealing brand presence that text-only listening (@sec-online-environments)misses entirely. **Image-driven engagement** is studied by relating content andaesthetic features—people versus products, warmth, professionalism, complexity—tolikes, shares, and comments; @Li_2019 show empirically which image-content featuresraise social-media engagement, and @hartmann2021power and @hartmann2023moredevelop image-mining pipelines for brand-relevant social-media contentand relate visual elements to engagement. **Influencer aesthetics** connect acreator's visual style to follower growth and sponsorship value(@sec-influencer-marketing), with content features quantifying the self-presentationstrategies that drive parasocial response. A recurring, sobering finding acrossthis work is that *what is depicted* (content) and *how it looks* (aesthetics) carrylargely separable effects, so a study that conflates them—loading both onto a single"image quality" score—misattributes one to the other.::: {.callout-note}The text-and-image fusion frontier. Marketing stimuli are rarely image-only: anad, a post, or a listing pairs a picture with words. **Multimodal** models learn ajoint embedding in which images and the text that co-occurs with them are mapped tothe same space, so a post can be represented by a single vector capturing both. Thislets the analyst ask whether the image and caption are *congruent*—and incongruenceis itself a measurable, theory-relevant feature—and connects images-as-data to thetext methods of @sec-text-as-data and the broader machine-learning toolkit of@sec-ai-ml. @witte2026language show that transformer-based vision-language models canclassify marketing images directly from natural-language prompts, often withouttask-specific training, and the same model class can now *generate* marketing visuals,raising the question of whether it reaches human-level creative quality[@hartmann2025generative]. The fusion of image with text and other modalities is thesubject of @sec-multimodal-fusion.:::## Pitfalls and IdentificationFive recurring failures separate credible image-as-data work from decorative work.The first is **treating predictions as measurements**, ignoring the generated-regressor problem of @sec-image-econometrics. Always report classifier validationand, where error may be non-classical, prefer experimental assignment of the image.The second is **dataset shift**. A backbone pre-trained on ImageNet encodes thevisual statistics of web photos circa its training era; applied to a domain it neversaw—X-ray packaging, niche product categories, a new platform's aesthetic—itsfeatures may be uninformative or biased. Validate on in-domain data, and fine-tunewhen the target distribution diverges.The third is **spurious cues and shortcut learning**. CNNs latch onto whateverpredicts the label in training, including artifacts—a watermark, a background, acamera type—that correlate with the outcome by accident. A classifier that appearsto "detect luxury" may have learned to detect studio lighting. Probe what the modelactually responds to before interpreting a feature substantively.The fourth is **conflating content and aesthetics**, collapsing *what is shown* and*how it looks* into one score when the social-media evidence says their effects areseparable. Keep the two families of features distinct and let the data assign theireffects.The fifth is **ethics and privacy**, especially for facial and demographicinference. Subgroup-correlated error and the inference of protected attributes raisefairness and legal exposure (@sec-privacy); coarse, validated labels are preferableto fine-grained inferences, and audits by subgroup are not optional.Underlying all five is a single discipline borrowed from the rest of this book:fix the representation before looking at outcomes, validate the features againstground truth, prefer designs that randomize the image over those that merelyobserve it, and report the bias that remains. An image feature is a *measurement*,and like every measurement in @sec-measurement-scales it must be shown to bereliable and valid before its coefficient can be believed. Images are one branch ofthe broader unstructured-data program [@balducci2018unstructured]; for a surveyfocused on the visual channel and its methods, see @dzyabura2021imageanalytics.::: {.callout-tip}## Replication resources: image analyticsThe hand-engineered color/complexity features and the frozen-embedding transfer-learningdemonstration in this chapter run on open R/Python tooling (`magick`/`imager` in R; Pillow,OpenCV, and pretrained `torchvision`/`timm` backbones in Python). The canonical backbonesship reference implementations—ResNet at `github.com/KaimingHe/deep-residual-networks`[@he2016resnet]; ImageNet-pretrained AlexNet/Inception weights [@krizhevsky2017imagenet;@szegedy2015googlenet] are bundled in every deep-learning framework—and SIFT[@lowe2004sift] is in OpenCV. The empirical marketing studies cited here(e.g., @liu2020, @hartmann2021brandselfies, @zhang2022goodimage) generally rely onproprietary image corpora; confirm any code/data release on the article page rather thanassuming one.:::## Key Takeaways- An image is a high-dimensional tensor (@eq-image-tensor); useful image analysis replaces raw pixels with **learned or hand-engineered features** whose dimensions a marketing theory can interpret.- The **convolutional network** (@eq-convolution) builds locality and translation structure in by construction; in marketing it is almost always used via **transfer learning**, with a frozen pre-trained backbone supplying an embedding to a small downstream model.- Marketing image features divide into **brand** (logo and perceptual identity, as in @liu2020), **aesthetic** (beauty and arousal potential, grounded in @Berlyne_1960), and **content** (objects, scenes, faces) families.- Image features are **generated regressors**: classical error attenuates effects by the reliability ratio (@eq-34-attenuation), and non-classical, outcome-correlated error biases them unpredictably. Validate, and prefer designs that randomize the image.- The strongest applications—advertising creative and social-media engagement—pair image features with **experimental assignment** of the visual stimulus, severing the confound between what an image looks like and the campaign that produced it.
45.6.2 Social Media and User-Generated Images
On social platforms, the image is the post. Engagement, reach, and brand-relevant outcomes depend on visual content the firm often does not produce, which makes image features the only way to measure exposure and appeal across user-generated corpora. Three measurement problems recur. Incidental brand exposure is captured by logo detection and the share-of-visual-voice it yields, revealing brand presence that text-only listening (Chapter 12) misses entirely. Image-driven engagement is studied by relating content and aesthetic features—people versus products, warmth, professionalism, complexity—to likes, shares, and comments; Li and Xie (2019) show empirically which image-content features raise social-media engagement, and Hartmann et al. (2021b) and Hartmann et al. (2023) develop image-mining pipelines for brand-relevant social-media content and relate visual elements to engagement. Influencer aesthetics connect a creator’s visual style to follower growth and sponsorship value (Chapter 17), with content features quantifying the self-presentation strategies that drive parasocial response. A recurring, sobering finding across this work is that what is depicted (content) and how it looks (aesthetics) carry largely separable effects, so a study that conflates them—loading both onto a single “image quality” score—misattributes one to the other.
The text-and-image fusion frontier. Marketing stimuli are rarely image-only: an ad, a post, or a listing pairs a picture with words. Multimodal models learn a joint embedding in which images and the text that co-occurs with them are mapped to the same space, so a post can be represented by a single vector capturing both. This lets the analyst ask whether the image and caption are congruent—and incongruence is itself a measurable, theory-relevant feature—and connects images-as-data to the text methods of Chapter 43 and the broader machine-learning toolkit of Chapter 65. Witte et al. (2026) show that transformer-based vision-language models can classify marketing images directly from natural-language prompts, often without task-specific training, and the same model class can now generate marketing visuals, raising the question of whether it reaches human-level creative quality (Hartmann, Exner, and Domdey 2025). The fusion of image with text and other modalities is the subject of Chapter 52.