45 Images as Data

Marketing has always been a visual discipline. A logo, a package, an advertisement, an influencer’s selfie, a product photograph on a retail site—each is a deliberately constructed image meant to move attention, belief, and demand. Until recently those images were data only in a metaphorical sense: researchers looked at them, coded them by hand, and ran small experiments on a handful of manipulated stimuli. What has changed is that an image can now be turned into numbers—high-dimensional, machine-readable features—at the scale of millions of photographs, and those numbers can be entered into the same demand systems, choice models, and regressions that the rest of this book develops. This chapter is about how to do that responsibly: how to extract brand, aesthetic, and content features from images, what the deep-learning machinery underneath actually computes, and how to deploy the resulting features in advertising and social-media research without fooling oneself.

The intellectual move mirrors the one made for text (Chapter 43). There, unstructured language becomes a document–term matrix or a sequence of embeddings; here, an unstructured raster of pixels becomes a feature vector. In both cases the representation is lossy and learned, and in both cases the central empirical danger is the same: the features are generated by a model whose errors are correlated with the very outcomes the researcher wants to explain, so naive regression confounds measurement with effect. The payoff for getting it right is large. Image features let the analyst measure constructs that were previously locked inside qualitative judgment—a brand’s visual identity, an ad’s aesthetic appeal, the “warmth” of a product photo—and relate them to clicks, engagement, sales, and firm value.

The chapter proceeds from pixels to constructs to applications. It first fixes what an image is as a mathematical object and what we mean by an image feature. It then develops the workhorse of modern computer vision—the convolutional neural network—giving its estimator, its loss, and the assumptions under which a feature extracted from it is meaningful. With that in hand it turns to the three families of marketing-relevant features (brand, aesthetic, content) and to the econometrics of using generated image features as regressors. It closes with applications to advertising and social media, and with the identification pitfalls that separate a credible image-as-data study from a decorative one.

45.1 What an Image Is

Formally, a digital image is a function sampled on a grid. A color image of height \(H\) and width \(W\) is a third-order tensor

\[ \mathbf{X} \in [0,1]^{H \times W \times C}, \qquad C = 3, \tag{45.1}\]

where the three channels \(C\) hold red, green, and blue intensities and each entry \(\mathbf{X}_{ijc}\) is a normalized pixel value. A modest \(224 \times 224\) RGB image—the canonical input size for many vision models—already lives in a space of dimension \(224 \times 224 \times 3 \approx 1.5 \times 10^{5}\). This is the curse of dimensionality in its rawest form: the number of pixels vastly exceeds the number of labeled examples in any marketing dataset, and pixels are individually almost meaningless. The pixel at position \((112, 60)\) carries no stable interpretation; what matters is the arrangement of pixels into edges, textures, objects, and scenes.

Two properties of images dictate everything about how they are modeled. First, locality: meaningful structure (an edge, a corner) is built from nearby pixels, so a useful representation should aggregate local neighborhoods before global ones. Second, translation structure: a logo is the same logo whether it sits in the top-left or the center of the frame, so a useful representation should respond similarly to a pattern regardless of where it appears. A model that ignored these properties—say, a fully connected network treating each pixel as an unrelated input—would have to relearn “what an edge looks like” separately at every location and would need astronomically more data to do so. The architectures that dominate computer vision are precisely those that build locality and translation structure in by construction.

Note

An image feature is any function \(f: [0,1]^{H \times W \times C} \to \mathbb{R}^{d}\) that maps a raw image to a lower-dimensional vector intended to capture a construct of interest. Features range from hand-engineered and interpretable (mean saturation, number of detected faces, fraction of the frame occupied by a brand logo) to learned and opaque (the 2{,}048-dimensional penultimate-layer activations of a deep network). The art of images-as-data is choosing features whose dimensions a marketing theory can speak about.

45.2 Classical Features: Color, Composition, and Hand-Engineering

Before deep learning, vision in marketing relied on hand-engineered features: quantities a researcher computes with an explicit formula and can defend to a referee line by line. They remain valuable precisely because they are transparent, and several map directly onto long-standing aesthetic theory.

Color is the most tractable. RGB is poor for human-meaningful description because its axes (red, green, blue intensity) do not correspond to how people talk about color, so analysts transform to the HSV space (hue, saturation, value), in which hue is the dominant wavelength, saturation the colorfulness, and value the brightness. From an image one can compute the mean and dispersion of each channel, the share of warm versus cool hues, and the colorfulness index, and relate these to response. This connects to a classical account of aesthetic preference: Berlyne (1960) argued that hedonic value is an inverted-U function of arousal potential—stimuli that are too simple bore and too complex overwhelm, with moderate complexity, novelty, and contrast preferred. Color statistics, visual entropy, and edge density are all operationalizations of arousal potential, and the inverted-U is a recurring empirical shape in this literature.

Composition features quantify where content sits and how much of it there is. Visual complexity—the amount and variety of detail in an image—has a long pedigree in advertising research, where Pieters, Wedel, and Batra (2010) distinguish feature complexity (irregular, dense visual elements) from design complexity (the elaborateness of the deliberate arrangement) and show the two have opposite effects on attention and attitude: feature complexity hurts brand attention while design complexity helps it. Earlier, Pieters, Wedel, and Zhang (2007) established that the eye-trackable structure of an ad—how gaze is distributed across the brand, the pictorial, and the text elements—predicts memory. Visual complexity is commonly proxied by file size after compression, by edge density from a Sobel or Canny filter, or by the entropy of the color histogram. The point of cataloguing these is not nostalgia: hand-engineered features are still the right tool when the construct is simple, the sample is small, or interpretability is paramount, and they make excellent controls alongside learned features.

A compact illustration computes interpretable color and complexity features for a synthetic image and shows how they vary with content.

Code

set.seed(34)

# Build three synthetic 64x64 RGB images with known properties:
# (a) a calm, low-saturation gray scene; (b) a vivid warm scene;
# (c) a high-complexity noisy scene. Each is an array [H, W, C] in [0,1].
make_image <- function(base, noise_sd) {
  arr <- array(base, dim = c(64, 64, 3))
  arr <- arr + array(rnorm(64 * 64 * 3, 0, noise_sd), dim = dim(arr))
  pmin(pmax(arr, 0), 1)
}
img_calm  <- make_image(c(0.55, 0.55, 0.55), 0.02)  # near-gray, low noise
img_warm  <- make_image(c(0.85, 0.35, 0.15), 0.04)  # warm (red/orange)
img_busy  <- make_image(c(0.50, 0.50, 0.50), 0.25)  # high-variance "busy"

# RGB -> HSV saturation and a colorfulness proxy (Hasler-Susstrunk style).
sat_value <- function(im) {
  mx <- pmax(im[, , 1], im[, , 2], im[, , 3])
  mn <- pmin(im[, , 1], im[, , 2], im[, , 3])
  mean(ifelse(mx > 0, (mx - mn) / mx, 0))            # mean saturation
}
colorfulness <- function(im) {
  rg <- im[, , 1] - im[, , 2]
  yb <- 0.5 * (im[, , 1] + im[, , 2]) - im[, , 3]
  sqrt(sd(rg)^2 + sd(yb)^2) + 0.3 * sqrt(mean(rg)^2 + mean(yb)^2)
}
# Edge density as a translation-invariant complexity proxy: mean absolute
# horizontal+vertical gradient of the luminance channel.
edge_density <- function(im) {
  lum <- 0.299 * im[, , 1] + 0.587 * im[, , 2] + 0.114 * im[, , 3]
  gx <- abs(lum[-1, ] - lum[-nrow(lum), ])
  gy <- abs(lum[, -1] - lum[, -ncol(lum)])
  mean(gx) + mean(gy)
}

features <- data.frame(
  image        = c("calm", "warm", "busy"),
  saturation   = c(sat_value(img_calm),  sat_value(img_warm),  sat_value(img_busy)),
  colorfulness = c(colorfulness(img_calm), colorfulness(img_warm), colorfulness(img_busy)),
  edge_density = c(edge_density(img_calm), edge_density(img_busy), edge_density(img_busy))
)
features[, -1] <- round(features[, -1], 3)
knitr::kable(features, caption = "Interpretable color and complexity features for three synthetic images.")

Interpretable color and complexity features for three synthetic images.
image	saturation	colorfulness	edge_density
calm	0.059	0.037	0.030
warm	0.823	0.679	0.361
busy	0.577	0.452	0.361

The warm image scores high on saturation and colorfulness; the busy image scores high on edge density—the digital analogue of Berlyne (1960)’s arousal potential. These features are cheap, reproducible, and interpretable, but they are blind to meaning: they cannot tell a logo from a face. For meaning, the field turns to learned representations.

45.3 Deep Representations: The Convolutional Neural Network

The central tool of modern computer vision is the convolutional neural network (CNN). Its design is a direct response to the two properties of images named above. Rather than connect every pixel to every hidden unit, a CNN slides small learnable filters across the image, so each unit sees only a local neighborhood (locality) and the same filter is applied at every position (translation structure, and an enormous reduction in parameters).

45.3.1 The Convolution Operation

A convolutional layer applies a bank of small filters (kernels) to its input. Let the input be a feature map \(\mathbf{Z} \in \mathbb{R}^{H \times W \times C_{\text{in}}}\) and let a single filter be \(\mathbf{K} \in \mathbb{R}^{k \times k \times C_{\text{in}}}\) with bias \(b\). The output at spatial location \((i,j)\) is

\[ (\mathbf{Z} * \mathbf{K})_{ij} = \sigma\!\left( b + \sum_{u=1}^{k}\sum_{v=1}^{k}\sum_{c=1}^{C_{\text{in}}} \mathbf{K}_{uvc}\,\mathbf{Z}_{\,i+u,\,j+v,\,c} \right), \tag{45.2}\]

where \(\sigma(\cdot)\) is a nonlinearity, almost always the rectified linear unit \(\sigma(z) = \max(0, z)\). A layer holds \(C_{\text{out}}\) such filters, producing an output tensor with \(C_{\text{out}}\) channels. Three properties make Equation 45.2 the right primitive. Parameter sharing: one filter, with \(k^2 C_{\text{in}} + 1\) parameters, is reused at every location, so a layer learns “detect this pattern anywhere” rather than memorizing locations. Local connectivity: each output depends only on a \(k \times k\) window, encoding locality. Translation equivariance: shifting the input shifts the output identically, so a logo detector fires wherever the logo appears.

Convolutional layers are interleaved with pooling layers that downsample—most commonly max pooling, which reports the maximum activation in each small window—building tolerance to small shifts and shrinking the spatial resolution while the channel dimension grows. Stacking these operations yields a representational hierarchy that has been verified empirically: early layers respond to oriented edges and color blobs, middle layers to textures and motifs, and late layers to object parts and whole objects (Zeiler and Fergus 2014). Figure 45.1 sketches the pipeline from raw pixels to a task head.

flowchart LR
  A["Raw image<br/>H × W × 3"] --> B["Conv + ReLU<br/>(edges, color)"]
  B --> C["Pool<br/>(downsample)"]
  C --> D["Conv + ReLU<br/>(textures, motifs)"]
  D --> E["Conv + ReLU<br/>(object parts)"]
  E --> F["Embedding z ∈ ℝ^d<br/>(penultimate layer)"]
  F --> G["Task head"]
  G --> H1["Classification<br/>(brand present?)"]
  G --> H2["Detection<br/>(where is the logo?)"]
  G --> H3["Regression<br/>(aesthetic score)"]

Figure 45.1: A convolutional network as a feature extractor with a task-specific head. Early layers encode generic low-level structure; deep layers encode semantic content. Marketing applications usually freeze the backbone and read off the penultimate-layer embedding.

45.3.2 The Estimator and Its Loss

A CNN with parameters \(\boldsymbol{\theta}\) (all filter weights and biases) defines a map \(g_{\boldsymbol{\theta}}: \mathbf{X} \mapsto \hat{\mathbf{y}}\). For a classification task with \(L\) labels (e.g., “contains a car,” “contains a dog”), the final layer produces a probability vector via the softmax, \(\hat{p}_\ell = \exp(s_\ell)/\sum_{m}\exp(s_m)\), where \(s_\ell\) are the network’s output scores (logits). Given labeled training data \(\{(\mathbf{X}_n, \mathbf{y}_n)\}_{n=1}^{N}\), the parameters minimize the cross-entropy loss with weight penalty,

\[ \hat{\boldsymbol{\theta}} = \arg\min_{\boldsymbol{\theta}}\; -\frac{1}{N}\sum_{n=1}^{N}\sum_{\ell=1}^{L} y_{n\ell}\,\log \hat{p}_{n\ell}(\boldsymbol{\theta}) \;+\; \lambda \lVert \boldsymbol{\theta} \rVert_2^2 , \tag{45.3}\]

solved by stochastic gradient descent: gradients are computed on mini-batches by backpropagation and the parameters stepped against them. The penalty \(\lambda \lVert \boldsymbol{\theta}\rVert_2^2\) (weight decay) is one of several regularizers—dropout, data augmentation, and early stopping are others—that the over-parameterized regime makes essential. This is the same empirical-risk-minimization template developed for prediction in Chapter 65; what is special here is only the architecture of \(g_{\boldsymbol{\theta}}\).

Warning

A CNN trained by Equation 45.3 minimizes predictive loss, not the recovery of any causal or structural quantity. Its outputs are calibrated to the training distribution and labels, nothing more. Treating a predicted label or an embedding as if it were a ground-truth measurement—rather than an estimate with distribution-dependent error—is the original sin of images-as-data, and Section 45.5 shows what it costs.

45.3.3 Transfer Learning: Why Marketing Rarely Trains From Scratch

No marketing dataset is large enough to estimate the tens of millions of parameters in a modern CNN from scratch. The field instead relies on transfer learning: take a backbone network pre-trained on a massive general-purpose corpus (canonically ImageNet, with roughly a million labeled images across a thousand object categories), discard its task head, and reuse its learned representation. The canonical backbones trace the field’s progress—the deep CNN that launched it (Krizhevsky, Sutskever, and Hinton 2017), the deeper inception architecture (Szegedy et al. 2015), and the residual networks (He et al. 2016) that remain a default image encoder—and any of them can be downloaded pre-trained and reused. Two modes are common. In feature extraction, the backbone is frozen and the penultimate-layer activations \(\mathbf{z} = h_{\boldsymbol{\theta}}(\mathbf{X}) \in \mathbb{R}^{d}\)—the embedding—are treated as a fixed, off-the-shelf feature vector fed to a simple downstream model (logistic regression, gradient boosting) trained on the marketing labels. In fine-tuning, the backbone weights are unfrozen and updated, usually with a small learning rate, so the representation adapts to the target domain.

The justification is the hierarchy of Figure 45.1: early- and middle-layer features (edges, textures, parts) are nearly universal across natural images, so they transfer; only the late, task-specific layers must be relearned. The practical rule is to fine-tune more layers the larger and more domain-specific the target data, and freeze more the smaller and more generic. The same logic underlies the brand-image work of Liu, Dzyabura, and Mizik (2020), who train a multi-label convolutional network on consumer-created images to recover perceptual brand attributes (see Section 45.4.1). For most marketing studies—where the labeled sample numbers in the thousands, not millions—frozen-backbone feature extraction is both the safest and the most reproducible choice.

Code

set.seed(34)

# Simulate the *output* of a frozen CNN backbone: each image is represented by a
# d-dimensional embedding z. In practice z = h_theta(image) from a pretrained
# network; here we generate embeddings whose geometry encodes two latent
# "visual styles" so a downstream classifier can separate them.
d <- 16; n_per <- 150
style_A_mean <- rnorm(d, 0, 1)            # e.g., "minimalist product shot"
style_B_mean <- style_A_mean + rnorm(d, 0, 1.2)  # e.g., "lifestyle scene"

Z_A <- matrix(rnorm(n_per * d), n_per, d) + matrix(style_A_mean, n_per, d, byrow = TRUE)
Z_B <- matrix(rnorm(n_per * d), n_per, d) + matrix(style_B_mean, n_per, d, byrow = TRUE)
Z   <- rbind(Z_A, Z_B)
y   <- factor(rep(c("minimalist", "lifestyle"), each = n_per))

# Downstream task: a simple logistic head on the frozen embedding (no CNN
# retraining). This is "feature extraction" transfer learning in miniature.
df  <- data.frame(y = y, Z)
idx <- sample(nrow(df), 0.7 * nrow(df))
fit <- glm(y ~ ., data = df[idx, ], family = binomial)
pred <- ifelse(predict(fit, df[-idx, ], type = "response") > 0.5,
               "minimalist", "lifestyle")
acc <- mean(pred == df[-idx, "y"])
cat("Held-out accuracy of logistic head on frozen embedding:",
    round(acc, 3), "\n")
#> Held-out accuracy of logistic head on frozen embedding: 0.989

The example deliberately separates the expensive, transferable part (the embedding, treated as given) from the cheap, task-specific part (a logistic head the analyst actually estimates)—the division of labor that makes images-as-data feasible in marketing.

45.4 Three Families of Marketing Image Features

Marketing-relevant image features fall into three families—brand, aesthetic, and content—distinguished by the construct they measure and the machinery they require. Table 45.1 summarizes them; the subsections that follow develop each in turn.

Table 45.1: Three families of marketing image features, the construct each targets, and how it is measured.

Family	Construct	Method	Output
Brand	Presence/prominence of brand marks; visual brand identity	Logo detection; learned brand-attribute classifiers	Logo box, area share, brand-attribute scores
Aesthetic	Beauty, professionalism, arousal potential, style	Aesthetic regressors; hand-engineered color/complexity	Continuous aesthetic / arousal score
Content	Objects, scenes, faces, emotions, activities depicted	Object detection; scene & face recognition	Object labels & counts, scene class, facial affect

45.4.1 Brand Features

Brand features answer two questions: is the brand here, and how prominently? The first is logo detection—an object-detection task that returns bounding boxes and confidence scores for known marks—and it underwrites a quantity of direct managerial interest: the share of visual voice, the fraction of image area (or of impressions) a brand’s marks occupy across a corpus of user- or sponsor-generated images. This matters because brand exposure on social media is increasingly incidental, embedded in content the firm does not control, so counting hashtags understates true exposure while logo detection captures it. The notion of prominence—the conspicuousness of a brand’s mark—carries status-signaling consequences developed in Chapter 11, and image data let it be measured at scale rather than coded by hand. The design of the mark itself is consequential: Luffarelli, Mukesh, and Mahmood (2019) show that descriptive logos (those that visually signal what the brand does) raise brand equity, and Mahmood, Luffarelli, and Mukesh (2019) find that complex visual logo cues shape equity-crowdfunding outcomes—logo features that computer vision can now extract at scale. How consumers depict brands in their own photos is itself a measurable typology: Hartmann et al. (2021a) use computer vision to distinguish “brand selfies” (the consumer in frame with the product) from “packshots” and show the two drive engagement differently.

The second, deeper question is whether an image conveys a brand’s perceptual identity. Liu, Dzyabura, and Mizik (2020) answer it directly: their BrandImageNet trains a multi-label CNN to detect abstract perceptual attributes (e.g., “glamorous,” “rugged,” “fun”) in ordinary consumer-created images, and the machine-derived portrait of a brand’s image tracks survey-based perceptions in near real time. Complementary approaches elicit brand perception by having consumers select images rather than rate scales (Dzyabura and Peres 2021) and mine brand attributes from the social-network structure around a brand (Culotta and Cutler 2016). This is a conceptual advance over logo counting—it measures what a brand means visually, not merely whether it is shown—and it links image data to the brand-meaning construct of Batra (2019), who locate visual cues (advertising, packaging) among the primary sources from which brand associations are built. Crucially, such attribute classifiers are learned, so their outputs are estimates with brand- and context-dependent error; this is exactly the generated-regressor problem formalized below.

45.4.2 Aesthetic Features

Aesthetic features score an image on beauty, professionalism, or arousal potential. Two routes exist. The hand-engineered route assembles the color and complexity statistics of the previous section into an interpretable index grounded in the Berlyne (1960) inverted-U. The learned route trains a CNN to regress human aesthetic ratings onto images, yielding a single continuous “aesthetic score” per image. Each route has its place: learned scores predict human judgment far better but are opaque and may encode spurious correlates of the training raters’ tastes; hand-engineered indices are weaker predictors but transparent and manipulable, which matters when the goal is advice (“raise saturation”) rather than mere ranking.

Aesthetics is not decoration. In advertising, the visual pleasure of an ad is a proximate driver of attention and attitude, mediated by the complexity mechanism of Pieters, Wedel, and Batra (2010) and by the gaze patterns of Pieters, Wedel, and Zhang (2007), and the rhetorical figuration of an image—visual metaphor, pun, and other tropes—shapes persuasion in ways parallel to verbal rhetoric (McQuarrie and Mick 1999). Specific perceptual choices carry meaning: color hue and saturation move brand personality and downstream response (Labrecque and Milne 2012; Labrecque, Patrick, and Milne 2013), and aesthetic styling trades off against perceived functionality rather than improving evaluation monotonically (Hagtvedt and Patrick 2014). In social commerce, image quality is a documented driver of demand: professionally composed product and listing photos lift engagement and conversion, and Zhang et al. (2022) show, using interpretable deep-learning image features on Airbnb listings, which properties of a photo (composition, brightness, depth) actually raise demand. The aesthetic score is therefore a feature with a theory behind it, not a black box to be regressed on outcomes for its own sake.

45.4.3 Content Features

Content features identify what is depicted. Object detection returns the set, count, and location of recognizable objects (a person, a beverage, a beach); scene recognition classifies the overall setting (kitchen, gym, nightclub); face analysis locates faces and, more controversially, infers attributes such as apparent emotion, age, or gaze direction. Content features connect images to substantive marketing theory: the presence of people versus products in a post, the warmth of depicted facial expressions, and the activities shown all map onto constructs—self-presentation, social proof, lifestyle congruence—that the social-media and influencer literatures (Chapter 17) treat as drivers of engagement.

Warning

Face- and demographic-inference features are the most fraught in this chapter. Apparent-emotion and inferred-demographic classifiers are trained on labeled data that encode the annotators’ cultural assumptions; their errors are systematically correlated with skin tone, age, and gender, raising both validity and fairness concerns. The privacy implications of inferring protected attributes from images are governed by the regimes in Chapter 24. Researchers should treat such features as low-confidence proxies, audit error rates by subgroup, and prefer coarse, well-validated labels (face present / absent) over fine-grained inferences (specific emotion, exact age) wherever the theory permits.

45.5 The Econometrics of Generated Image Features

The defining methodological problem of images-as-data is that the features are generated regressors: they are not observed but predicted by a first-stage model (the CNN), and that prediction carries error. Suppose the structural object of interest is

\[ y_i = \alpha + \beta\, f_i^{\ast} + \mathbf{w}_i^{\top}\boldsymbol{\gamma} + \varepsilon_i, \tag{45.4}\]

where \(y_i\) is an outcome (clicks, likes, sales), \(f_i^{\ast}\) is the true image feature (true aesthetic appeal, true logo prominence), and \(\mathbf{w}_i\) are controls. The analyst does not observe \(f_i^{\ast}\); the CNN supplies an estimate \(\hat f_i = f_i^{\ast} + u_i\). Regressing \(y_i\) on \(\hat f_i\) instead of \(f_i^{\ast}\) raises three distinct hazards.

Attenuation from classical measurement error. If \(u_i\) is mean-zero noise independent of \(f_i^{\ast}\) and \(\varepsilon_i\), the ordinary-least-squares estimand is biased toward zero by the reliability ratio,

\[ \operatorname{plim}\hat\beta_{\text{OLS}} = \beta \cdot \frac{\operatorname{Var}(f^{\ast})} {\operatorname{Var}(f^{\ast}) + \operatorname{Var}(u)} \;<\; \beta . \tag{45.5}\]

The noisier the classifier (the smaller its reliability), the more the true effect is understated. A nonsignificant coefficient on an image feature can therefore mean “no effect” or “a noisy detector,” and the two are not distinguishable without an estimate of \(\operatorname{Var}(u)\).

Non-classical, correlated error. Worse, CNN error is rarely classical. A logo detector may systematically miss small or occluded logos; an aesthetic regressor may rate professionally lit photos higher and such photos may be posted by firms that also buy promotion. When \(\operatorname{Cov}(u_i, \varepsilon_i) \neq 0\)—because the same unobserved factor (a firm’s production budget) drives both the classifier’s error and the outcome—\(\hat\beta\) is biased in an unknown direction and no reliability correction recovers \(\beta\). This is the image analogue of the endogeneity problems treated throughout Chapter 40, and it is the reason image features must be validated against, and where possible instrumented or held fixed within an experiment.

Forbidden-regression and look-elsewhere risks. Because embeddings are high-dimensional, an analyst can search across hundreds of learned dimensions for one that correlates with the outcome and report it as a “discovered visual driver.” Without pre-registration or a held-out test, this is overfitting dressed as inference. The discipline that text-as-data imposes—split the sample, fix the representation before looking at outcomes, report out-of-sample fit—applies with full force here.

Three remedies recur. First, validate the feature against human labels on a held-out subsample and report the classifier’s accuracy or correlation with ground truth, so the reader can gauge \(\operatorname{Var}(u)\). Second, experimentally manipulate the feature rather than only observe it: the cleanest images-as-data designs randomize which image a user sees (an A/B test on ad creative) so that \(f_i\) is assigned, not estimated from confounded content, which severs \(\operatorname{Cov}(u_i, \varepsilon_i)\) by design. Third, correct or bound the bias—correcting attenuation with an estimated reliability ratio when error is plausibly classical, and otherwise reporting the OLS estimate as a conservative (attenuated) bound on a positive effect. The simulation below makes the attenuation in Equation 45.5 concrete and shows that a reliability correction recovers the true slope when error is classical.

Code

set.seed(34)
n <- 4000
beta_true <- 1.5

# True (latent) image feature f* and outcome y generated from it.
f_star <- rnorm(n, 0, 1)
y <- 2 + beta_true * f_star + rnorm(n, 0, 1)

# A noisy CNN measures f* with classical error: reliability rho = Var(f*)/Var(fhat).
sigma_u <- 0.9
f_hat <- f_star + rnorm(n, 0, sigma_u)
rho   <- var(f_star) / var(f_hat)                # estimable from a validation set

naive  <- coef(lm(y ~ f_hat))["f_hat"]           # attenuated OLS
corrected <- naive / rho                         # reliability-ratio correction

cat("True slope beta          :", beta_true, "\n")
#> True slope beta          : 1.5
cat("Reliability ratio (rho)  :", round(rho, 3), "\n")
#> Reliability ratio (rho)  : 0.556
cat("Naive OLS on f_hat       :", round(naive, 3),
    " (attenuated toward 0)\n")
#> Naive OLS on f_hat       : 0.828  (attenuated toward 0)
cat("Reliability-corrected    :", round(corrected, 3), "\n")
#> Reliability-corrected    : 1.49

The naive slope is pulled toward zero by exactly the reliability ratio; dividing by \(\rho\)—which a validation set identifies—restores the truth. The correction works only under classical error; when the detector’s mistakes correlate with the outcome, no such fix exists and design-based identification is the only honest route.

45.6 Applications in Advertising and Social Media

45.6.1 Advertising Creative

The richest application is the measurement and optimization of advertising creative. The visual content of an ad—its complexity, color, brand prominence, and pacing—drives whether viewers attend, remember, and respond, and image (and video-frame) features make these properties measurable at the scale of entire ad libraries. The construct lineage is clear: visual complexity and gaze (Pieters, Wedel, and Zhang (2007); Pieters, Wedel, and Batra (2010)), the arousal-potential account of aesthetic preference (Berlyne (1960)), and the visual sources of brand meaning (Batra (2019)) all become features once a CNN is in the loop. Video advertising extends this from stills to sequences. The dynamics of attention within a video ad—when a brand appears, how scenes are cut, what holds the viewer—predict skipping and recall; T. S. Teixeira, Wedel, and Pieters (2010) show that inducing joy and surprise sustains attention to online video ads, and T. Teixeira, Picard, and el Kaliouby (2014) that emotional and attentional dynamics jointly govern whether viewers watch through or zap. Per-frame image features are precisely how those dynamics are operationalized at scale, a theme picked up in the video-marketing literature (Rajaram and Manchanda (2020); Leyva and Sanchez (2021)) and developed further in Chapter 13. The methodological caveat from Section 45.5 binds hardest here: observational creative comparisons confound the image with the campaign that bought it, so the credible designs hold the audience fixed and randomize the creative.

45.7 Pitfalls and Identification

Five recurring failures separate credible image-as-data work from decorative work.

The first is treating predictions as measurements, ignoring the generated- regressor problem of Section 45.5. Always report classifier validation and, where error may be non-classical, prefer experimental assignment of the image.

The second is dataset shift. A backbone pre-trained on ImageNet encodes the visual statistics of web photos circa its training era; applied to a domain it never saw—X-ray packaging, niche product categories, a new platform’s aesthetic—its features may be uninformative or biased. Validate on in-domain data, and fine-tune when the target distribution diverges.

The third is spurious cues and shortcut learning. CNNs latch onto whatever predicts the label in training, including artifacts—a watermark, a background, a camera type—that correlate with the outcome by accident. A classifier that appears to “detect luxury” may have learned to detect studio lighting. Probe what the model actually responds to before interpreting a feature substantively.

The fourth is conflating content and aesthetics, collapsing what is shown and how it looks into one score when the social-media evidence says their effects are separable. Keep the two families of features distinct and let the data assign their effects.

The fifth is ethics and privacy, especially for facial and demographic inference. Subgroup-correlated error and the inference of protected attributes raise fairness and legal exposure (Chapter 24); coarse, validated labels are preferable to fine-grained inferences, and audits by subgroup are not optional.

Underlying all five is a single discipline borrowed from the rest of this book: fix the representation before looking at outcomes, validate the features against ground truth, prefer designs that randomize the image over those that merely observe it, and report the bias that remains. An image feature is a measurement, and like every measurement in Chapter 35 it must be shown to be reliable and valid before its coefficient can be believed. Images are one branch of the broader unstructured-data program (Balducci and Marinova 2018); for a survey focused on the visual channel and its methods, see Dzyabura, El Kihal, and Peres (2021).

Replication resources: image analytics

The hand-engineered color/complexity features and the frozen-embedding transfer-learning demonstration in this chapter run on open R/Python tooling (magick/imager in R; Pillow, OpenCV, and pretrained torchvision/timm backbones in Python). The canonical backbones ship reference implementations—ResNet at github.com/KaimingHe/deep-residual-networks (He et al. 2016); ImageNet-pretrained AlexNet/Inception weights (Krizhevsky, Sutskever, and Hinton 2017; Szegedy et al. 2015) are bundled in every deep-learning framework—and SIFT (Lowe 2004) is in OpenCV. The empirical marketing studies cited here (e.g., Liu, Dzyabura, and Mizik (2020), Hartmann et al. (2021a), Zhang et al. (2022)) generally rely on proprietary image corpora; confirm any code/data release on the article page rather than assuming one.

45.8 Key Takeaways

An image is a high-dimensional tensor (Equation 45.1); useful image analysis replaces raw pixels with learned or hand-engineered features whose dimensions a marketing theory can interpret.
The convolutional network (Equation 45.2) builds locality and translation structure in by construction; in marketing it is almost always used via transfer learning, with a frozen pre-trained backbone supplying an embedding to a small downstream model.
Marketing image features divide into brand (logo and perceptual identity, as in Liu, Dzyabura, and Mizik (2020)), aesthetic (beauty and arousal potential, grounded in Berlyne (1960)), and content (objects, scenes, faces) families.
Image features are generated regressors: classical error attenuates effects by the reliability ratio (Equation 45.5), and non-classical, outcome-correlated error biases them unpredictably. Validate, and prefer designs that randomize the image.
The strongest applications—advertising creative and social-media engagement—pair image features with experimental assignment of the visual stimulus, severing the confound between what an image looks like and the campaign that produced it.

Balducci, Bitty, and Detelina Marinova. 2018. “Unstructured Data in Marketing.” Journal of the Academy of Marketing Science 46 (4): 557–90. https://doi.org/10.1007/s11747-018-0581-x.

Batra, Rajeev. 2019. “Creating Brand Meaning: A Review and Research Agenda.” Journal of Consumer Psychology 29 (3): 535–46. https://doi.org/10.1002/jcpy.1122.

Berlyne, D. E. 1960. “Toward a Theory of Exploratory Behavior: I. Arousal and Drive.” In Conflict, Arousal, and Curiosity., 163–92. McGraw-Hill Book Company. https://doi.org/10.1037/11164-007.

Culotta, Aron, and Jennifer Cutler. 2016. “Mining Brand Perceptions from Twitter Social Networks.” Marketing Science 35 (3): 343–62. https://doi.org/10.1287/mksc.2015.0968.

Dzyabura, Daria, Siham El Kihal, and Renana Peres. 2021. “Image Analytics in Marketing.” In Handbook of Market Research, 665–92. Springer. https://doi.org/10.1007/978-3-319-57413-4_38.

Dzyabura, Daria, and Renana Peres. 2021. “Visual Elicitation of Brand Perception.” Journal of Marketing 85 (4): 44–66. https://doi.org/10.1177/0022242921996661.

Hagtvedt, Henrik, and Vanessa M. Patrick. 2014. “Consumer Response to Overstyling: Balancing Aesthetics and Functionality in Product Design.” Psychology and Marketing 31 (7): 518–25. https://doi.org/10.1002/mar.20713.

Hartmann, Jochen, Yannick Exner, and Samuel Domdey. 2025. “The Power of Generative Marketing: Can Generative AI Create or Reach Human-Level Visual Marketing Content?” International Journal of Research in Marketing 42 (1): 13–31. https://doi.org/10.1016/j.ijresmar.2024.09.002.

Hartmann, Jochen, Mark Heitmann, Christina Schamp, and Oded Netzer. 2021b. “The Power of Brand Selfies.” Journal of Marketing Research 58 (6): 1159–77.

———. 2021a. “The Power of Brand Selfies.” Journal of Marketing Research 58 (6): 1159–77. https://doi.org/10.1177/00222437211037258.

Hartmann, Jochen, Mark Heitmann, Christian Siebert, and Christina Schamp. 2023. “More Than a Feeling: Accuracy and Application of Sentiment Analysis.” International Journal of Research in Marketing 40 (1): 75–87.

He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. “Deep Residual Learning for Image Recognition.” In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770–78. https://doi.org/10.1109/cvpr.2016.90.

Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. 2017. “ImageNet Classification with Deep Convolutional Neural Networks.” Communications of the ACM 60 (6): 84–90. https://doi.org/10.1145/3065386.

Labrecque, Lauren I., and George R. Milne. 2012. “Exciting Red and Competent Blue: The Importance of Color in Marketing.” Journal of the Academy of Marketing Science 40 (5): 711–27. https://doi.org/10.1007/s11747-010-0245-y.

Labrecque, Lauren I., Vanessa M. Patrick, and George R. Milne. 2013. “The Marketers’ Prismatic Palette: A Review of Color Research and Future Directions.” Psychology and Marketing 30 (2): 187–202. https://doi.org/10.1002/mar.20597.

Leyva, Roberto, and Victor Sanchez. 2021. “Video Memorability Prediction via Late Fusion of Deep Multi-Modal Features.” In 2021 IEEE International Conference on Image Processing (ICIP), 2488–92. IEEE.

Li, Yiyi, and Ying Xie. 2019. “Is a Picture Worth a Thousand Words? An Empirical Study of Image Content and Social Media Engagement.” Journal of Marketing Research 57 (1): 1–19. https://doi.org/10.1177/0022243719881113.

Liu, Liu, Daria Dzyabura, and Natalie Mizik. 2020. “Visual Listening In: Extracting Brand Image Portrayed on Social Media.” Marketing Science 39 (4): 669–86. https://doi.org/10.1287/mksc.2020.1226.

Lowe, David G. 2004. “Distinctive Image Features from Scale-Invariant Keypoints.” International Journal of Computer Vision 60 (2): 91–110. https://doi.org/10.1023/b:visi.0000029664.99615.94.

Luffarelli, Jonathan, Mudra Mukesh, and Ammara Mahmood. 2019. “Let the Logo Do the Talking: The Influence of Logo Descriptiveness on Brand Equity.” Journal of Marketing Research 56 (5): 862–78. https://doi.org/10.1177/0022243719845000.

Mahmood, Ammara, Jonathan Luffarelli, and Mudra Mukesh. 2019. “What’s in a Logo? The Impact of Complex Visual Cues in Equity Crowdfunding.” Journal of Business Venturing 34 (1): 41–62. https://doi.org/10.1016/j.jbusvent.2018.09.006.

McQuarrie, Edward F., and David Glen Mick. 1999. “Visual Rhetoric in Advertising: Text-Interpretive, Experimental, and Reader-Response Analyses.” Journal of Consumer Research 26 (1): 37–54. https://doi.org/10.1086/209549.

Pieters, Rik, Michel Wedel, and Rajeev Batra. 2010. “The Stopping Power of Advertising: Measures and Effects of Visual Complexity.” Journal of Marketing 74 (5): 48–60. https://doi.org/10.1509/jmkg.74.5.48.

Pieters, Rik, Michel Wedel, and Jie Zhang. 2007. “Optimal Feature Advertising Design Under Competitive Clutter.” Management Science 53 (11): 1815–28. https://doi.org/10.1287/mnsc.1070.0732.

Rajaram, Prashant, and Puneet Manchanda. 2020. “Video Influencers: Unboxing the Mystique.” arXiv Preprint arXiv:2012.12311.

Szegedy, Christian, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. “Going Deeper with Convolutions.” In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1–9. https://doi.org/10.1109/cvpr.2015.7298594.

Teixeira, Thales S., Michel Wedel, and Rik Pieters. 2010. “Moment-to-Moment Optimal Branding in TV Commercials: Preventing Avoidance by Pulsing.” Marketing Science 29 (5): 783–804. https://doi.org/10.1287/mksc.1100.0567.

Teixeira, Thales, Rosalind Picard, and Rana el Kaliouby. 2014. “Why, When, and How Much to Entertain Consumers in Advertisements? A Web-Based Facial Tracking Field Study.” Marketing Science 33 (6): 809–27. https://doi.org/10.1287/mksc.2014.0854.

Witte, Maximilian, Mark Heitmann, Jochen Hartmann, and Keno Tetzlaff. 2026. “Language of Images: Classifying Marketing Images with Transformers and Vision Language Models.” International Journal of Research in Marketing, January. https://doi.org/10.1016/j.ijresmar.2026.01.001.

Zeiler, Matthew D., and Rob Fergus. 2014. “Visualizing and Understanding Convolutional Networks.” In Computer Vision – ECCV 2014, 818–33. Lecture Notes in Computer Science. https://doi.org/10.1007/978-3-319-10590-1_53.

Zhang, Shunyuan, Dokyun Lee, Param Vir Singh, and Kannan Srinivasan. 2022. “What Makes a Good Image? Airbnb Demand Analytics Leveraging Interpretable Image Features.” Management Science 68 (8): 5644–66. https://doi.org/10.1287/mnsc.2021.4175.

45.1 What an Image Is

45.2 Classical Features: Color, Composition, and Hand-Engineering

45.3 Deep Representations: The Convolutional Neural Network

45.3.1 The Convolution Operation

45.3.2 The Estimator and Its Loss

45.3.3 Transfer Learning: Why Marketing Rarely Trains From Scratch

45.4 Three Families of Marketing Image Features

45.4.1 Brand Features

45.4.2 Aesthetic Features

45.4.3 Content Features

45.5 The Econometrics of Generated Image Features

45.6 Applications in Advertising and Social Media

45.6.1 Advertising Creative

45.6.2 Social Media and User-Generated Images

45.7 Pitfalls and Identification

45.8 Key Takeaways