29 Data

Empirical marketing research is only as credible as the data behind it, and the data a researcher can obtain increasingly determine which questions are even askable. This chapter is a working guide to acquiring the raw material of marketing science—firm financials, advertising and engagement metrics, household expenditure, and the digital traces of consumer attention—and, just as importantly, to understanding what each source actually measures. Acquisition is never neutral: every dataset embeds a sampling frame, a measurement model, and a set of selection mechanisms that propagate into every downstream estimate. A panel that surveys 6,000 households quarterly, an index that rescales search volume to its own maximum, and an API that silently throttles after 1,400 calls each impose structure on what the analyst can learn.

We organize the material by source, but the through-line is measurement. For each source we state, as precisely as the source allows, the estimand—the population quantity of interest—and the observable the source actually delivers, because the gap between the two is where most empirical mistakes live. Search interest is not search volume; a relative index is not an absolute count; a name-based demographic prediction is a posterior probability, not a fact. Treating these distinctions casually is how plausible pipelines produce biased coefficients.

The chapter is also deliberately reproducible. Connection recipes, query patterns, and rescaling algorithms appear as runnable code so that a reader can move from “this dataset exists” to “I have it in a tidy frame” without reverse engineering undocumented APIs. Where a third-party platform’s own documentation is the authoritative reference—rate limits, OAuth scopes, schema—we point to the mechanism rather than reproduce ephemeral details. Throughout, the goal is the one stated in the style guide: intuition first, then the formal object, then code that runs.

Figure 29.1 situates the sources covered here along the two dimensions that matter most for research design: the unit of observation (firm, household, or individual consumer) and whether the data are behavioral (revealed in actions) or survey-elicited (reported by respondents).

flowchart TB
  subgraph FIRM["Firm-level"]
    WRDS["WRDS / Compustat<br/>financials, R&D, advertising"]
  end
  subgraph HH["Household-level"]
    CE["Consumer Expenditure Survey<br/>elicited expenditure (panel + diary)"]
  end
  subgraph IND["Individual / keyword-level"]
    GT["Google Trends<br/>relative search interest"]
    BD["Baidu Index<br/>search volume"]
    YT["YouTube<br/>views, comments, engagement"]
    NAME["Name-based inference<br/>gender / age / nationality"]
  end
  WRDS -->|"revealed (accounting)"| OUT["Research design:<br/>estimand vs. observable"]
  CE -->|"elicited (survey)"| OUT
  GT -->|"revealed (behavior)"| OUT
  BD -->|"revealed (behavior)"| OUT
  YT -->|"revealed (behavior)"| OUT
  NAME -->|"inferred (model)"| OUT

Figure 29.1: The data sources in this chapter, organized by unit of observation and by whether they capture revealed behavior or elicited reports. Behavioral traces are abundant and timely but noisy proxies for latent constructs; survey instruments measure constructs directly but are costly, lagged, and subject to reporting bias.

29.1 Firm Financial Data: WRDS

Most marketing-finance research (Chapter 23) and a large share of strategy work rest on standardized firm financials. Wharton Research Data Services (WRDS) is the dominant academic gateway to these: it federates Compustat (accounting fundamentals), CRSP (security prices and returns), I/B/E/S (analyst forecasts), and dozens of other libraries behind a single SQL interface. For the marketing researcher the attraction is concrete. Advertising expense (xad) and research-and-development expense (xrd) in Compustat’s fundamentals-annual file (funda) are the canonical proxies for a firm’s marketing and innovation intensity, and they join cleanly to market-based outcomes such as Tobin’s $q$ or abnormal returns used throughout Chapter 23.

The connection is a standard PostgreSQL session; WRDS exposes its warehouse over the wire, so the same DBI idioms used for any relational database apply. Credentials should never be hard-coded—read them from the environment so that the source file is safe to share.

Code

library(RPostgres)
library(tidyverse)

wrds <- dbConnect(
  Postgres(),
  host    = "wrds-pgdata.wharton.upenn.edu",
  port    = 9737,
  dbname  = "wrds",
  sslmode = "require",
  user    = Sys.getenv("wrds_user"),
  pass    = Sys.getenv("wrds_pass")
)

The warehouse is self-describing through the information_schema, which is the disciplined way to discover what is available rather than guessing table names. Three nested queries answer the three questions a new user has: which libraries (schemas) exist, which tables live in a chosen library, and which columns a chosen table holds.

Code

# Libraries (schemas) available to your subscription
dbGetQuery(wrds, "
  select distinct table_schema
  from information_schema.tables
  order by table_schema")

# Tables inside one library
dbGetQuery(wrds, "
  select distinct table_name
  from information_schema.columns
  where table_schema = 'comp_na_daily_all'
  order by table_name")

# Columns inside one table
dbGetQuery(wrds, "
  select column_name
  from information_schema.columns
  where table_schema = 'comp_na_daily_all'
    and table_name   = 'funda'
  order by column_name")

A representative extraction pulls advertising and R&D for a decade, then enriches each firm-year with its industry classification by joining on the firm identifier gvkey. Pushing the filters (fyear, non-null xad/xrd) into the WHERE clause rather than filtering in R keeps the transfer small; this matters when funda runs to millions of rows.

Code

fin <- dbGetQuery(wrds, "
  select gvkey, fyear, xad, xrd
  from comp_na_daily_all.funda
  where fyear between 2000 and 2010
    and xad is not null
    and xrd is not null") |>
  distinct()

industry <- dbGetQuery(wrds, "
  select gvkey, gind, gsubind, naics, sic
  from comp_na_daily_all.names")

panel <- fin |>
  left_join(industry, by = "gvkey")

head(panel)

Two cautions recur in practice and shape identification downstream. First, xad is reported advertising; under accounting rules many firms expense marketing without breaking it out, so missingness is non-random—it correlates with firm size, industry, and disclosure regime. Treating is null as “no advertising” rather than “not separately disclosed” silently selects the sample on a variable correlated with the outcome, the textbook recipe for selection bias. Second, funda ships in multiple formats (consolidated vs. parent-only, standardized vs. restated) keyed by the indfmt, datafmt, popsrc, and consol flags; the conventional screen for North American industrial firms is indfmt='INDL' and datafmt='STD' and popsrc='D' and consol='C', and omitting it double-counts firm-years. Neither caution is exotic, but both are invisible until an estimate looks wrong.

Note

Beyond WRDS, the Consumer Expenditure Survey (below) and commercial scanner panels (e.g., IRI, Nielsen) cover demand-side behavior that firm financials cannot. The general principle—match the unit of observation to the estimand—is developed in Chapter 35.

29.2 Local Document Stores: MongoDB

Not all marketing data are relational. Social-media posts, scraped reviews, and API responses arrive as nested, schema-flexible JSON, and a document database such as MongoDB stores them without forcing a rectangular schema up front. When ingesting at scale the first operational question is simply where the data physically live, which the running server reports through an administrative command issued in the mongo shell:

db.adminCommand("getCmdLineOpts")

The returned document includes the storage.dbPath, the on-disk location of the data files—useful for capacity planning, backups, and confirming that a large ingest is writing where you expect. The broader point is architectural: document stores trade the relational guarantees of WRDS-style sources (a fixed schema, referential integrity, declarative joins) for flexibility on write. That trade is right for heterogeneous, evolving web data and wrong for the tidy panels demanded by most econometric estimators, which is why a common pipeline lands raw captures in MongoDB and projects a flattened, analysis-ready frame into R or a relational store before modeling.

29.3 Search Interest: Google Trends

Aggregate search behavior is among the most useful behavioral signals in marketing because it is timely, granular, and a leading indicator of attention. The motivating regularity is that public interest and its institutional reflections move together but on different clocks: online search volume and news reporting are strongly correlated, whereas academic publishing lags public interest because of the delay inherent in peer review and publication (Nghiem et al. 2016). Read structurally, news media act as a conductor between the research community and the public, and online search is the most immediate barometer of the public’s side of that exchange.

29.3.1 What Google Trends Measures

Google Trends does not report search volume. It reports a relative search interest index, and the distinction is the single most important thing to understand about the source. For a keyword $k$, a geography $g$, and a chosen time window, let $v_{k,t}$ be the underlying volume of queries for $k$ in period $t$ (itself a sampled, privacy-filtered quantity). Trends returns the rescaled series

\[ I_{k,t} \;=\; 100 \times \frac{v_{k,t}}{\displaystyle\max_{s \in W} v_{k,s}}, \tag{29.1}\]

where $W$ is the requested window. Three consequences follow directly from Equation 29.1 and trip up the unwary.

First, the index is window-dependent. The denominator is the maximum within the requested window, so the same keyword queried over two different windows generally returns different index values for the overlapping dates—the series is not anchored to any fixed scale. Second, the index is co-normalized across the keywords in a single request: when several keywords are submitted together they share one denominator (the global maximum across all of them), which is exactly what makes cross-keyword comparison valid within a request and invalid across separate requests. Third, the data are sampled, so repeated identical queries return slightly different indices; for stable estimates one averages several draws.

The window also dictates the granularity Google will return, which is fixed by the length of the requested span:

Table 29.1: Native temporal granularity returned by Google Trends as a function of the requested window length. Longer windows are silently coarsened, which is why reconstructing a long daily series requires the stitching procedure in Section 29.3.2.

Data granularity	Window length
Hourly	Last 7 days
Daily	Less than 9 months
Weekly	9 months to 5 years
Monthly	Longer than 5 years

A first pass pulls weekly relative interest for two brands over a multi-year window. The gtrendsR package wraps the public endpoint; interest_over_time holds the index series.

Code

library(gtrendsR)
library(tidyverse)

trends <- gtrends(
  keyword = c("7eleven", "3m"),
  geo     = "US",
  time    = "2010-01-01 2012-01-30",  # earliest supported start is 2004
  gprop   = "web"                       # web | news | images | froogle | youtube
)

time_trend <- trends$interest_over_time   # weekly index for this window

Because Equation 29.1 produces a relative, sampled series, plotting it raw shows both signal and seasonal noise. A smoother separates the trend component from within-year seasonality without committing to a parametric seasonal model.

Code

library(ggplot2)

ggplot(time_trend, aes(x = date, y = hits, group = keyword, colour = keyword)) +
  geom_line(alpha = 0.4) +
  geom_smooth(span = 0.5, se = FALSE) +
  labs(x = "Time", y = "Relative interest (weekly)",
       title = "Google search interest", colour = NULL) +
  theme_bw() +
  theme(legend.position = "bottom")

29.3.2 Reconstructing a Long Daily Series

The granularity rule in Table 29.1 creates a real problem: researchers often need daily resolution over multi-year spans, but a multi-year request returns only weekly (or monthly) data. The fix exploits the co-normalization property of Equation 29.1. Two reconstruction strategies exist, and they differ in how they reconcile the high-frequency and low-frequency series.

The scaling (overlapping) method is the more reliable of the two. It queries short windows at daily resolution, queries the full span at lower resolution to obtain period weights, and rescales the daily pieces so that each month’s daily indices are multiplied by that month’s relative weight. Formally, let $d_{k,t}$ be the daily index obtained from a short within-window query covering month $m(t)$, and let $w_{k,m}$ be the month-$m$ weight derived from the long-window query (the month’s aggregate index divided by the maximum monthly aggregate). The reconstructed daily series is the product

\[ \hat{v}_{k,t} \;=\; d_{k,t} \times w_{k,m(t)}, \tag{29.2}\]

which restores a common scale across months while preserving within-month daily shape. The concatenation (normalization) method instead stitches one-month daily queries end to end and normalizes them against the weekly series; empirical comparisons favor the scaling method, which is why Equation 29.2 is the recommended estimator.¹ This reconstruction underlies published applications that require daily search data over long horizons (Risteski and Davcev 2014), and related approaches to extracting daily volume are documented elsewhere (Y. Liu et al. 2019).

The function below implements Equation 29.2 end to end: it pulls a single aggregated query to derive monthly multipliers mult, loops over months to pull daily data, and joins the two to produce reconstructed daily estimates est_hits. Note the rate-limit budget of roughly 1,400 requests per 4 hours; a long reconstruction must respect it or the loop will start returning empty frames.

Code

library(gtrendsR)
library(tidyverse)
library(lubridate)

get_daily_gtrend <- function(keyword = c("7eleven", "3M"),
                             geo  = "US",
                             from = "2013-01-01",
                             to   = "2013-02-15") {
  # Disallow a 'to' date in the current (incomplete) month
  if (ymd(to) >= floor_date(Sys.Date(), "month")) {
    to <- floor_date(ymd(to), "month") - days(1)
    if (to < from) stop("`to` date in the current month is not allowed")
  }

  # Long query -> monthly multipliers w_{k,m}
  aggregated <- gtrends(keyword = keyword, geo = geo, time = paste(from, to))
  if (is.null(aggregated$interest_over_time)) {
    message("No data in Google Trends for this query.")
    return(invisible(NULL))
  }

  mult_m <- aggregated$interest_over_time |>
    mutate(hits = as.integer(ifelse(hits == "<1", "0", hits))) |>
    group_by(month = floor_date(date, "month"), keyword) |>
    summarise(hits = sum(hits), .groups = "drop") |>
    mutate(ym = format(month, "%Y-%m"), mult = hits / max(hits)) |>
    select(month, ym, keyword, mult)

  # Per-month daily queries -> d_{k,t}
  windows <- tibble(
    s = seq(ymd(from), ymd(to), by = "month"),
    e = seq(ymd(from), ymd(to), by = "month") + months(1) - days(1)
  )

  raw_daily <- map2_dfr(windows$s, windows$e, function(s, e) {
    curr <- gtrends(keyword, geo = geo, time = paste(s, e))
    if (is.null(curr$interest_over_time)) return(tibble())
    curr$interest_over_time
  })

  trend_d <- raw_daily |>
    transmute(date, keyword,
              ym   = format(date, "%Y-%m"),
              hits = as.integer(ifelse(hits == "<1", "0", hits)))

  # Reconstruct: est_hits = d_{k,t} * w_{k,m(t)}   (eq-gtrends-scaling)
  trend_d |>
    left_join(mult_m, by = c("ym", "keyword")) |>
    mutate(est_hits = hits * mult, date = as.Date(date)) |>
    select(date, keyword, est_hits)
}

daily_trend <- get_daily_gtrend(
  keyword = c("7eleven", "3M"), geo = "US",
  from = "2013-01-01", to = "2013-02-01")
head(daily_trend)

29.3.3 Absolute Volume

When the application genuinely requires absolute counts rather than the relative index of Equation 29.1—for example, to size a market rather than track its trajectory—the relative-interest endpoint is the wrong source. Google publishes a separate top-terms dataset through BigQuery that exposes absolute search figures, and that warehouse, not the Trends index, is the appropriate input for level estimates.

29.4 Search Volume in China: Baidu Index

Outside the markets Google serves, the analogous behavioral signal comes from the dominant local search engine. In China that is Baidu, and the Baidu Index plays the role Google Trends plays elsewhere. Because Baidu exposes no comparable public API for bulk extraction, researchers have built crawlers to recover its search-volume series; P. J. Liu et al. (2019) document a Java-based spider for exactly this purpose. The same measurement caveats apply—an index is a rescaled, sampled proxy for latent interest, not a count—so the estimand-versus-observable discipline of Section 29.3 carries over directly.

29.5 Video Engagement: YouTube

Video is now a primary advertising and engagement channel, and YouTube exposes rich behavioral data—views, likes, comments, and watch metrics—at the video, playlist, and channel level. Two access paths exist, and the choice between them is a governance decision rather than a convenience one.

The first path is the public Data API, which serves publicly visible statistics for any video or channel using a simple API key. The second is the Analytics and Reporting APIs, which serve owner-private metrics (impressions, audience retention, revenue) and therefore require OAuth authorization by the channel or content owner that owns the requested data. Table 29.2 contrasts the two; the operative rule is that anything a viewer cannot see requires the owner’s delegated consent.

Table 29.2: Two paths to YouTube data. The boundary is consent: public statistics need only a key, whereas owner-private analytics require OAuth authorization by the data owner.

Dimension	Data API (key)	Analytics / Reporting (OAuth)
Auth	API key	OAuth 2.0, owner-delegated
Scope	Public statistics	Owner-private metrics
Typical fields	views, likes, comments	impressions, retention, revenue
Who can call	anyone with a key	the channel/content owner
Use case	competitive/market scans	first-party performance analytics

29.5.1 OAuth Path via `tuber`

The tuber package wraps the authenticated endpoints. Authorization is a one-time OAuth handshake with credentials provisioned in the Google Cloud console; thereafter the session token authorizes calls for video statistics, details, captions, search, and comment threads.

Code

library(tuber)

# Provision app_id / app_secret in the Google Cloud console, then:
yt_oauth(app_id = "YOUR_APP_ID", app_secret = "YOUR_APP_SECRET")

get_stats(video_id = "N708P-A45D0")          # public statistics
get_video_details(video_id = "N708P-A45D0")  # snippet metadata
get_captions(video_id = "yJXTXN4xrI8")       # caption track
yt_search(term = "test")                      # search the corpus
get_comment_threads(c(video_id = "N708P-A45D0"))  # top-level comments
get_all_comments(video_id = "a-UQz7fqR3w")    # comments + replies

A common research task is to assemble a panel of every video in a channel. YouTube models a channel’s uploads as a hidden playlist, so the recipe is: read the channel’s relatedPlaylists$uploads id, page through that playlist’s items to collect video ids, then map the statistics call over the ids and bind the results into a frame.

Code

chan <- list_channel_resources(
  filter = c(channel_id = "UCT5Cx1l4IS3wHkJXNyuj4TA"),
  part   = "contentDetails")

uploads_id <- chan$items[[1]]$contentDetails$relatedPlaylists$uploads
vids       <- get_playlist_items(filter = c(playlist_id = uploads_id))
vid_ids    <- as.vector(vids$contentDetails.videoId)

stats_df <- do.call(rbind, lapply(vid_ids, function(id) data.frame(get_stats(id))))
head(stats_df)

29.5.2 Key Path via the Data API

When only public statistics are needed, the raw REST endpoints are lighter than an OAuth session. The pattern is a small set of helper functions that build the request URL, parse the JSON, and return a tidy frame—one helper per resource type (video, playlist, channel).

Code

library(jsonlite)
library(dplyr)

API_key <- Sys.getenv("YOUTUBE_API_KEY")   # never hard-code keys

video_stats <- function(video_id, key) {
  url <- paste0("https://www.googleapis.com/youtube/v3/videos",
                "?part=snippet,statistics&id=", video_id, "&key=", key)
  res <- fromJSON(url)
  data.frame(
    name  = res$items$snippet$channelTitle,
    res$items$statistics,
    title = res$items$snippet$title,
    date  = res$items$snippet$publishedAt
  )
}

channel_stats <- function(channel_id, key) {
  url <- paste0("https://www.googleapis.com/youtube/v3/channels",
                "?part=snippet,contentDetails,statistics&id=", channel_id,
                "&key=", key)
  res <- fromJSON(url)
  data.frame(
    name = res$items$snippet$title,
    res$items$statistics,
    uploads = res$items$contentDetails$relatedPlaylists$uploads
  )
}

These frames feed directly into comparative visualizations—total channel views, or the time path of comments and likes across competing channels—using the ggplot2 idioms already shown for Google Trends. Because the engagement counts (viewCount, commentCount) arrive as character strings, they must be coerced to numeric before plotting.

Warning

A requests-and-BeautifulSoup scraper that pages through search-result HTML can retrieve video ids without an API key, but scraping rendered pages is brittle, violates many terms of service, and breaks whenever the front-end markup changes. Prefer the official endpoints; treat HTML scraping as a last resort for prototyping, never production.

29.6 Household Expenditure: The Consumer Expenditure Survey

When the research question concerns what households buy—budget shares, category demand, the income elasticity of a product class—the canonical U.S. source is the Consumer Expenditure (CE) Survey administered by the Bureau of Labor Statistics. Its sampling design is worth understanding in detail, because the survey’s structure dictates which estimands it can support and at what frequency.

The CE’s unit of observation is the consumer unit, defined as a single person or a group of persons who live together and share responsibility for most major expenditures.

A consumer unit is a single person, or a group of persons who live together and who share responsibilities for most major expenditures.

Within each unit, the reference person is the first individual named when the respondent answers, “Who is responsible for owning or renting this home?”—a convention that fixes the household head deterministically and makes cross-survey linkage well defined.

The survey is not one instrument but two complementary ones, a design choice that follows directly from a measurement trade-off between recall accuracy and coverage. Large, infrequent purchases are remembered well but happen rarely; small, frequent purchases are forgotten quickly but dominate the budget in count. No single instrument measures both well, so the CE splits them:

The Interview Survey captures big-ticket and recurring expenditures, plus global estimates for some categories. It is a rotating-panel survey: each consumer unit is interviewed once every three months for four quarters, with roughly 6,000 units in the sample each quarter. The panel structure supports within-unit change over the year, but the four-quarter cap means it cannot follow a household across years.
The Diary Survey captures small-ticket, frequently purchased items that recall-based interviewing would miss. It is independent of the interview component and collects roughly 14,000 diaries from participating families each year.

Published tabulations come in several time formats: annual tables run January–December (available since 1984) and, in a fiscal variant, July–June (since 2013); two-year tables span January of year one through December of year two (since 1986). Beyond the standard tables, BLS produces experimental research tabulations—covering, for instance, income detail or generational groups such as millennials and Gen X—available on request. The practical implication for design is that the rotating four-quarter panel is the right frame for within-year, within-household questions, while cross-year dynamics require pooling repeated cross-sections rather than following the same units.

29.7 Inferring Demographics from Names

A recurring need in marketing research is to attach demographic attributes—gender, age, nationality—to records that carry only a name, as when a researcher has a roster of customers, reviewers, or social-media authors but no self-reported demographics. A family of services performs exactly this inference: gender from genderize.io, age from agify.io, and nationality from nationalize.io. Each returns, for a queried name, a predicted attribute together with a confidence.

It is essential to read these outputs as what they are: posterior probabilities from a name-frequency classifier, not ground-truth labels. The underlying logic is a simple Bayesian classifier. Let $n$ be an observed name and $g$ a candidate attribute value (say, gender). The services effectively report

\[ \Pr(g \mid n) \;=\; \frac{\Pr(n \mid g)\,\Pr(g)}{\displaystyle\sum_{g'} \Pr(n \mid g')\,\Pr(g')}, \tag{29.3}\]

estimated from large registries of names tagged with the attribute, where $\Pr(n \mid g)$ is the frequency of name $n$ among individuals with value $g$ and $\Pr(g)$ the base rate of $g$ in the reference population. Three properties of Equation 29.3 govern responsible use. The prediction inherits the reference population’s composition: a classifier trained predominantly on one region’s registries mispredicts systematically for names common elsewhere, so the base rate $\Pr(g)$ is not universal. The reported confidence is the posterior itself, so unisex or rare names yield diffuse, low-confidence posteriors that should not be hardened into deterministic labels. And because the inferred attribute is a generated regressor, using it as a covariate in a downstream model imports classification error into that model—the resulting attenuation and the need to propagate the first-stage uncertainty are treated in Chapter 35. Used with those caveats, name-based inference is a cheap way to recover coarse demographic structure; used naively, it manufactures measurement error correlated with name origin.

29.8 Pitfalls and Identification

The sources in this chapter differ in unit, frequency, and provenance, but the threats to valid inference share a small number of structural causes. Figure 29.2 maps each common pitfall to the source where it bites hardest and to its remedy.

flowchart LR
  A["Non-random missingness<br/>(unreported xad in Compustat)"] --> A2["Model disclosure;<br/>do not treat NULL as zero"]
  B["Relative, window-dependent index<br/>(Google Trends, Baidu)"] --> B2["Rescale via overlapping<br/>method (eq-gtrends-scaling)"]
  C["Sampling noise in indices<br/>(Trends draws)"] --> C2["Average repeated queries"]
  D["Generated-regressor error<br/>(name-based demographics)"] --> D2["Propagate first-stage<br/>uncertainty"]
  E["Recall vs. coverage trade-off<br/>(CE survey)"] --> E2["Match instrument to estimand<br/>(interview vs. diary)"]
  F["Rate limits / TOS<br/>(APIs, scrapers)"] --> F2["Budget requests; prefer<br/>official endpoints"]

Figure 29.2: Recurring threats to validity across the chapter’s data sources and their remedies. Each threat is a gap between the estimand and the observable that the source actually delivers.

The unifying discipline is the one stated at the outset: name the estimand, name the observable the source delivers, and treat the difference as a modeling problem rather than a nuisance. Reported advertising is not advertising; a relative index is not a volume; a name-based prediction is not a demographic. Each gap is tractable once it is made explicit, and ruinous when it is not.

29.9 Key Takeaways

Match the unit of observation to the estimand. WRDS delivers firm-level financials, the CE Survey household expenditure, and search/engagement APIs individual behavioral traces; no source substitutes for another (Figure 29.1).
Reported is not measured. Compustat’s xad records disclosed advertising; missingness is non-random and treating NULL as zero selects the sample on a variable correlated with outcomes (Section 29.1).
A relative index is not a volume. Google Trends rescales to its window’s maximum (Equation 29.1); long daily series must be reconstructed by the overlapping-scaling method (Equation 29.2), and absolute counts require a different source entirely.
Survey design encodes a recall–coverage trade-off. The CE splits big-ticket recall (interview panel) from frequent-purchase coverage (diary), so the right instrument depends on the category (Section 29.6).
Name-based demographics are posteriors, not labels. They follow a Bayesian name-frequency classifier (Equation 29.3) whose error is correlated with name origin and must be propagated downstream (Section 29.7).
Respect the platform contract. Owner-private metrics require OAuth, public ones a key (Table 29.2); honor rate limits and prefer official endpoints over brittle scrapers.

Liu, Peggy J., Kelly L. Haws, Karen Scherr, Joseph P. Redden, James R. Bettman, and Gavan J. Fitzsimons. 2019. “The Primacy of “What” over “How Much”: How Type and Quantity Shape Healthiness Perceptions of Food Portions.” Management Science 65 (7): 3353–81. https://doi.org/10.1287/mnsc.2018.3098.

Liu, Ying, Geng Peng, Lanyi Hu, Jichang Dong, and Qingqing Zhang. 2019. “Using Google Trends and Baidu Index to Analyze the Impacts of Disaster Events on Company Stock Prices.” Industrial Management & Data Systems 120 (2): 350–65. https://doi.org/10.1108/imds-03-2019-0190.

Nghiem, Le T. P., Sarah K. Papworth, Felix K. S. Lim, and Luis R. Carrasco. 2016. “Analysis of the Capacity of Google Trends to Measure Interest in Conservation Topics and the Role of Online News.” Edited by Zhong-Ke Gao. PLOS ONE 11 (3): e0152802. https://doi.org/10.1371/journal.pone.0152802.

Risteski, Dimche, and Danco Davcev. 2014. “Can We Use Daily Internet Search Query Data to Improve Predicting Power of EGARCH Models for Financial Time Series Volatility?” International Conference on Computer Science and Information Systems (ICSIS’2014) Oct 17-18, 2014 Dubai (UAE), October. https://doi.org/10.15242/iie.e1014066.

The overlapping-scaling procedure follows the approach popularized by Alex Dyachenko; the concatenation variant mirrors the dailydata routine in the pytrends library. Both predate—and are wrapped by—the function below.↩︎

# Data {#sec-data} Empirical marketing research is only as credible as the data behind it, and the data a researcher can obtain increasingly determine which questions are even askable. This chapter is a working guide to *acquiring* the raw material of marketing science—firm financials, advertising and engagement metrics, household expenditure, and the digital traces of consumer attention—and, just as importantly, to understanding what each source actually measures. Acquisition is never neutral: every dataset embeds a sampling frame, a measurement model, and a set of selection mechanisms that propagate into every downstream estimate. A panel that surveys 6,000 households quarterly, an index that rescales search volume to its own maximum, and an API that silently throttles after 1,400 calls each impose structure on what the analyst can learn. We organize the material by source, but the through-line is measurement. For each source we state, as precisely as the source allows, the *estimand*—the population quantity of interest—and the *observable* the source actually delivers, because the gap between the two is where most empirical mistakes live. Search interest is not search volume; a relative index is not an absolute count; a name-based demographic prediction is a posterior probability, not a fact. Treating these distinctions casually is how plausible pipelines produce biased coefficients. The chapter is also deliberately reproducible. Connection recipes, query patterns, and rescaling algorithms appear as runnable code so that a reader can move from "this dataset exists" to "I have it in a tidy frame" without reverse engineering undocumented APIs. Where a third-party platform's own documentation is the authoritative reference—rate limits, OAuth scopes, schema—we point to the mechanism rather than reproduce ephemeral details. Throughout, the goal is the one stated in the style guide: intuition first, then the formal object, then code that runs. @fig-data-landscape situates the sources covered here along the two dimensions that matter most for research design: the unit of observation (firm, household, or individual consumer) and whether the data are *behavioral* (revealed in actions) or *survey-elicited* (reported by respondents). ```{mermaid} %%| label: fig-data-landscape %%| fig-cap: "The data sources in this chapter, organized by unit of observation and by whether they capture revealed behavior or elicited reports. Behavioral traces are abundant and timely but noisy proxies for latent constructs; survey instruments measure constructs directly but are costly, lagged, and subject to reporting bias." flowchart TB subgraph FIRM["Firm-level"] WRDS["WRDS / Compustat financials, R&D, advertising"] end subgraph HH["Household-level"] CE["Consumer Expenditure Survey elicited expenditure (panel + diary)"] end subgraph IND["Individual / keyword-level"] GT["Google Trends relative search interest"] BD["Baidu Index search volume"] YT["YouTube views, comments, engagement"] NAME["Name-based inference gender / age / nationality"] end WRDS -->|"revealed (accounting)"| OUT["Research design: estimand vs. observable"] CE -->|"elicited (survey)"| OUT GT -->|"revealed (behavior)"| OUT BD -->|"revealed (behavior)"| OUT YT -->|"revealed (behavior)"| OUT NAME -->|"inferred (model)"| OUT ``` ## Firm Financial Data: WRDS {#sec-data-wrds} Most marketing-finance research (@sec-marketing-finance) and a large share of strategy work rest on standardized firm financials. Wharton Research Data Services (WRDS) is the dominant academic gateway to these: it federates Compustat (accounting fundamentals), CRSP (security prices and returns), I/B/E/S (analyst forecasts), and dozens of other libraries behind a single SQL interface. For the marketing researcher the attraction is concrete. Advertising expense (`xad`) and research-and-development expense (`xrd`) in Compustat's fundamentals-annual file (`funda`) are the canonical proxies for a firm's marketing and innovation intensity, and they join cleanly to market-based outcomes such as Tobin's $q$ or abnormal returns used throughout @sec-marketing-finance. The connection is a standard PostgreSQL session; WRDS exposes its warehouse over the wire, so the same `DBI` idioms used for any relational database apply. Credentials should never be hard-coded—read them from the environment so that the source file is safe to share. ```{r wrds-connect, message=FALSE, eval=FALSE} library(RPostgres) library(tidyverse) wrds <- dbConnect( Postgres(), host = "wrds-pgdata.wharton.upenn.edu", port = 9737, dbname = "wrds", sslmode = "require", user = Sys.getenv("wrds_user"), pass = Sys.getenv("wrds_pass") ) ``` The warehouse is self-describing through the `information_schema`, which is the disciplined way to discover what is available rather than guessing table names. Three nested queries answer the three questions a new user has: which *libraries* (schemas) exist, which *tables* live in a chosen library, and which *columns* a chosen table holds. ```{r wrds-explore, eval=FALSE} # Libraries (schemas) available to your subscription dbGetQuery(wrds, " select distinct table_schema from information_schema.tables order by table_schema") # Tables inside one library dbGetQuery(wrds, " select distinct table_name from information_schema.columns where table_schema = 'comp_na_daily_all' order by table_name") # Columns inside one table dbGetQuery(wrds, " select column_name from information_schema.columns where table_schema = 'comp_na_daily_all' and table_name = 'funda' order by column_name") ``` A representative extraction pulls advertising and R&D for a decade, then enriches each firm-year with its industry classification by joining on the firm identifier `gvkey`. Pushing the filters (`fyear`, non-null `xad`/`xrd`) into the `WHERE` clause rather than filtering in R keeps the transfer small; this matters when `funda` runs to millions of rows. ```{r wrds-extract, eval=FALSE} fin <- dbGetQuery(wrds, " select gvkey, fyear, xad, xrd from comp_na_daily_all.funda where fyear between 2000 and 2010 and xad is not null and xrd is not null") |> distinct() industry <- dbGetQuery(wrds, " select gvkey, gind, gsubind, naics, sic from comp_na_daily_all.names") panel <- fin |> left_join(industry, by = "gvkey") head(panel) ``` Two cautions recur in practice and shape identification downstream. First, `xad` is *reported* advertising; under accounting rules many firms expense marketing without breaking it out, so missingness is non-random—it correlates with firm size, industry, and disclosure regime. Treating `is null` as "no advertising" rather than "not separately disclosed" silently selects the sample on a variable correlated with the outcome, the textbook recipe for selection bias. Second, `funda` ships in multiple *formats* (consolidated vs. parent-only, standardized vs. restated) keyed by the `indfmt`, `datafmt`, `popsrc`, and `consol` flags; the conventional screen for North American industrial firms is `indfmt='INDL' and datafmt='STD' and popsrc='D' and consol='C'`, and omitting it double-counts firm-years. Neither caution is exotic, but both are invisible until an estimate looks wrong. ::: {.callout-note} Beyond WRDS, the **Consumer Expenditure Survey** (below) and commercial scanner panels (e.g., IRI, Nielsen) cover demand-side behavior that firm financials cannot. The general principle—match the unit of observation to the estimand—is developed in @sec-measurement-scales. ::: ## Local Document Stores: MongoDB {#sec-data-mongo} Not all marketing data are relational. Social-media posts, scraped reviews, and API responses arrive as nested, schema-flexible JSON, and a document database such as MongoDB stores them without forcing a rectangular schema up front. When ingesting at scale the first operational question is simply *where the data physically live*, which the running server reports through an administrative command issued in the `mongo` shell: ``` db.adminCommand("getCmdLineOpts") ``` The returned document includes the `storage.dbPath`, the on-disk location of the data files—useful for capacity planning, backups, and confirming that a large ingest is writing where you expect. The broader point is architectural: document stores trade the relational guarantees of WRDS-style sources (a fixed schema, referential integrity, declarative joins) for flexibility on write. That trade is right for heterogeneous, evolving web data and wrong for the tidy panels demanded by most econometric estimators, which is why a common pipeline *lands* raw captures in MongoDB and *projects* a flattened, analysis-ready frame into R or a relational store before modeling. ## Search Interest: Google Trends {#sec-data-gtrends} Aggregate search behavior is among the most useful behavioral signals in marketing because it is timely, granular, and a leading indicator of attention. The motivating regularity is that public interest and its institutional reflections move together but on different clocks: online search volume and news reporting are strongly correlated, whereas academic publishing lags public interest because of the delay inherent in peer review and publication [@nghiem2016]. Read structurally, news media act as a conductor between the research community and the public, and online search is the most immediate barometer of the public's side of that exchange. ### What Google Trends Measures Google Trends does **not** report search volume. It reports a *relative search interest index*, and the distinction is the single most important thing to understand about the source. For a keyword $k$, a geography $g$, and a chosen time window, let $v_{k,t}$ be the underlying volume of queries for $k$ in period $t$ (itself a sampled, privacy-filtered quantity). Trends returns the rescaled series $$ I_{k,t} \;=\; 100 \times \frac{v_{k,t}}{\displaystyle\max_{s \in W} v_{k,s}}, $$ {#eq-gtrends-index} where $W$ is the requested window. Three consequences follow directly from @eq-gtrends-index and trip up the unwary. First, the index is **window-dependent**. The denominator is the maximum *within the requested window*, so the same keyword queried over two different windows generally returns different index values for the overlapping dates—the series is not anchored to any fixed scale. Second, the index is **co-normalized across the keywords in a single request**: when several keywords are submitted together they share one denominator (the global maximum across all of them), which is exactly what makes cross-keyword comparison valid *within* a request and invalid *across* separate requests. Third, the data are **sampled**, so repeated identical queries return slightly different indices; for stable estimates one averages several draws. The window also dictates the *granularity* Google will return, which is fixed by the length of the requested span: | Data granularity | Window length | |------------------|------------------------------| | Hourly | Last 7 days | | Daily | Less than 9 months | | Weekly | 9 months to 5 years | | Monthly | Longer than 5 years | : Native temporal granularity returned by Google Trends as a function of the requested window length. Longer windows are silently coarsened, which is why reconstructing a long daily series requires the stitching procedure in @sec-data-gtrends-daily. {#tbl-gtrends-granularity} A first pass pulls weekly relative interest for two brands over a multi-year window. The `gtrendsR` package wraps the public endpoint; `interest_over_time` holds the index series. ```{r gtrends-basic, eval=FALSE} library(gtrendsR) library(tidyverse) trends <- gtrends( keyword = c("7eleven", "3m"), geo = "US", time = "2010-01-01 2012-01-30", # earliest supported start is 2004 gprop = "web" # web | news | images | froogle | youtube ) time_trend <- trends$interest_over_time # weekly index for this window ``` Because @eq-gtrends-index produces a relative, sampled series, plotting it raw shows both signal and seasonal noise. A smoother separates the trend component from within-year seasonality without committing to a parametric seasonal model. ```{r gtrends-plot, eval=FALSE} library(ggplot2) ggplot(time_trend, aes(x = date, y = hits, group = keyword, colour = keyword)) + geom_line(alpha = 0.4) + geom_smooth(span = 0.5, se = FALSE) + labs(x = "Time", y = "Relative interest (weekly)", title = "Google search interest", colour = NULL) + theme_bw() + theme(legend.position = "bottom") ``` ### Reconstructing a Long Daily Series {#sec-data-gtrends-daily} The granularity rule in @tbl-gtrends-granularity creates a real problem: researchers often need *daily* resolution over *multi-year* spans, but a multi-year request returns only weekly (or monthly) data. The fix exploits the co-normalization property of @eq-gtrends-index. Two reconstruction strategies exist, and they differ in how they reconcile the high-frequency and low-frequency series. The **scaling (overlapping) method** is the more reliable of the two. It queries short windows at daily resolution, queries the full span at lower resolution to obtain period weights, and rescales the daily pieces so that each month's daily indices are multiplied by that month's relative weight. Formally, let $d_{k,t}$ be the daily index obtained from a short within-window query covering month $m(t)$, and let $w_{k,m}$ be the month-$m$ weight derived from the long-window query (the month's aggregate index divided by the maximum monthly aggregate). The reconstructed daily series is the product $$ \hat{v}_{k,t} \;=\; d_{k,t} \times w_{k,m(t)}, $$ {#eq-gtrends-scaling} which restores a common scale across months while preserving within-month daily shape. The **concatenation (normalization) method** instead stitches one-month daily queries end to end and normalizes them against the weekly series; empirical comparisons favor the scaling method, which is why @eq-gtrends-scaling is the recommended estimator.[^gtrends-attrib] This reconstruction underlies published applications that require daily search data over long horizons [@risteski2014], and related approaches to extracting daily volume are documented elsewhere [@liugoogle2019]. [^gtrends-attrib]: The overlapping-scaling procedure follows the approach popularized by Alex Dyachenko; the concatenation variant mirrors the `dailydata` routine in the `pytrends` library. Both predate—and are wrapped by—the function below. The function below implements @eq-gtrends-scaling end to end: it pulls a single aggregated query to derive monthly multipliers `mult`, loops over months to pull daily data, and joins the two to produce reconstructed daily estimates `est_hits`. Note the rate-limit budget of roughly **1,400 requests per 4 hours**; a long reconstruction must respect it or the loop will start returning empty frames. ```{r gtrends-daily, eval=FALSE} library(gtrendsR) library(tidyverse) library(lubridate) get_daily_gtrend <- function(keyword = c("7eleven", "3M"), geo = "US", from = "2013-01-01", to = "2013-02-15") { # Disallow a 'to' date in the current (incomplete) month if (ymd(to) >= floor_date(Sys.Date(), "month")) { to <- floor_date(ymd(to), "month") - days(1) if (to < from) stop("`to` date in the current month is not allowed") } # Long query -> monthly multipliers w_{k,m} aggregated <- gtrends(keyword = keyword, geo = geo, time = paste(from, to)) if (is.null(aggregated$interest_over_time)) { message("No data in Google Trends for this query.") return(invisible(NULL)) } mult_m <- aggregated$interest_over_time |> mutate(hits = as.integer(ifelse(hits == "<1", "0", hits))) |> group_by(month = floor_date(date, "month"), keyword) |> summarise(hits = sum(hits), .groups = "drop") |> mutate(ym = format(month, "%Y-%m"), mult = hits / max(hits)) |> select(month, ym, keyword, mult) # Per-month daily queries -> d_{k,t} windows <- tibble( s = seq(ymd(from), ymd(to), by = "month"), e = seq(ymd(from), ymd(to), by = "month") + months(1) - days(1) ) raw_daily <- map2_dfr(windows$s, windows$e, function(s, e) { curr <- gtrends(keyword, geo = geo, time = paste(s, e)) if (is.null(curr$interest_over_time)) return(tibble()) curr$interest_over_time }) trend_d <- raw_daily |> transmute(date, keyword, ym = format(date, "%Y-%m"), hits = as.integer(ifelse(hits == "<1", "0", hits))) # Reconstruct: est_hits = d_{k,t} * w_{k,m(t)} (eq-gtrends-scaling) trend_d |> left_join(mult_m, by = c("ym", "keyword")) |> mutate(est_hits = hits * mult, date = as.Date(date)) |> select(date, keyword, est_hits) } daily_trend <- get_daily_gtrend( keyword = c("7eleven", "3M"), geo = "US", from = "2013-01-01", to = "2013-02-01") head(daily_trend) ``` ### Absolute Volume When the application genuinely requires *absolute* counts rather than the relative index of @eq-gtrends-index—for example, to size a market rather than track its trajectory—the relative-interest endpoint is the wrong source. Google publishes a separate top-terms dataset through BigQuery that exposes absolute search figures, and that warehouse, not the Trends index, is the appropriate input for level estimates. ## Search Volume in China: Baidu Index {#sec-data-baidu} Outside the markets Google serves, the analogous behavioral signal comes from the dominant local search engine. In China that is Baidu, and the **Baidu Index** plays the role Google Trends plays elsewhere. Because Baidu exposes no comparable public API for bulk extraction, researchers have built crawlers to recover its search-volume series; @liu2019 document a Java-based spider for exactly this purpose. The same measurement caveats apply—an index is a rescaled, sampled proxy for latent interest, not a count—so the estimand-versus-observable discipline of @sec-data-gtrends carries over directly. ## Video Engagement: YouTube {#sec-data-youtube} Video is now a primary advertising and engagement channel, and YouTube exposes rich behavioral data—views, likes, comments, and watch metrics—at the video, playlist, and channel level. Two access paths exist, and the choice between them is a governance decision rather than a convenience one. The first path is the **public Data API**, which serves *publicly visible* statistics for any video or channel using a simple API key. The second is the **Analytics and Reporting APIs**, which serve *owner-private* metrics (impressions, audience retention, revenue) and therefore require OAuth authorization by the channel or content owner that owns the requested data. @tbl-youtube-apis contrasts the two; the operative rule is that anything a viewer cannot see requires the owner's delegated consent. | Dimension | Data API (key) | Analytics / Reporting (OAuth) | |-----------|----------------|-------------------------------| | Auth | API key | OAuth 2.0, owner-delegated | | Scope | Public statistics | Owner-private metrics | | Typical fields | views, likes, comments | impressions, retention, revenue | | Who can call | anyone with a key | the channel/content owner | | Use case | competitive/market scans | first-party performance analytics | : Two paths to YouTube data. The boundary is consent: public statistics need only a key, whereas owner-private analytics require OAuth authorization by the data owner. {#tbl-youtube-apis} ### OAuth Path via `tuber` The `tuber` package wraps the authenticated endpoints. Authorization is a one-time OAuth handshake with credentials provisioned in the Google Cloud console; thereafter the session token authorizes calls for video statistics, details, captions, search, and comment threads. ```{r youtube-oauth, eval=FALSE} library(tuber) # Provision app_id / app_secret in the Google Cloud console, then: yt_oauth(app_id = "YOUR_APP_ID", app_secret = "YOUR_APP_SECRET") get_stats(video_id = "N708P-A45D0") # public statistics get_video_details(video_id = "N708P-A45D0") # snippet metadata get_captions(video_id = "yJXTXN4xrI8") # caption track yt_search(term = "test") # search the corpus get_comment_threads(c(video_id = "N708P-A45D0")) # top-level comments get_all_comments(video_id = "a-UQz7fqR3w") # comments + replies ``` A common research task is to assemble a panel of *every* video in a channel. YouTube models a channel's uploads as a hidden playlist, so the recipe is: read the channel's `relatedPlaylists$uploads` id, page through that playlist's items to collect video ids, then map the statistics call over the ids and bind the results into a frame. ```{r youtube-channel, eval=FALSE} chan <- list_channel_resources( filter = c(channel_id = "UCT5Cx1l4IS3wHkJXNyuj4TA"), part = "contentDetails") uploads_id <- chan$items[[1]]$contentDetails$relatedPlaylists$uploads vids <- get_playlist_items(filter = c(playlist_id = uploads_id)) vid_ids <- as.vector(vids$contentDetails.videoId) stats_df <- do.call(rbind, lapply(vid_ids, function(id) data.frame(get_stats(id)))) head(stats_df) ``` ### Key Path via the Data API When only public statistics are needed, the raw REST endpoints are lighter than an OAuth session. The pattern is a small set of helper functions that build the request URL, parse the JSON, and return a tidy frame—one helper per resource type (video, playlist, channel). ```{r youtube-api, eval=FALSE} library(jsonlite) library(dplyr) API_key <- Sys.getenv("YOUTUBE_API_KEY") # never hard-code keys video_stats <- function(video_id, key) { url <- paste0("https://www.googleapis.com/youtube/v3/videos", "?part=snippet,statistics&id=", video_id, "&key=", key) res <- fromJSON(url) data.frame( name = res$items$snippet$channelTitle, res$items$statistics, title = res$items$snippet$title, date = res$items$snippet$publishedAt ) } channel_stats <- function(channel_id, key) { url <- paste0("https://www.googleapis.com/youtube/v3/channels", "?part=snippet,contentDetails,statistics&id=", channel_id, "&key=", key) res <- fromJSON(url) data.frame( name = res$items$snippet$title, res$items$statistics, uploads = res$items$contentDetails$relatedPlaylists$uploads ) } ``` These frames feed directly into comparative visualizations—total channel views, or the time path of comments and likes across competing channels—using the `ggplot2` idioms already shown for Google Trends. Because the engagement counts (`viewCount`, `commentCount`) arrive as character strings, they must be coerced to numeric before plotting. ::: {.callout-warning} A `requests`-and-`BeautifulSoup` scraper that pages through search-result HTML can retrieve video ids without an API key, but scraping rendered pages is brittle, violates many terms of service, and breaks whenever the front-end markup changes. Prefer the official endpoints; treat HTML scraping as a last resort for prototyping, never production. ::: ## Household Expenditure: The Consumer Expenditure Survey {#sec-data-ce} When the research question concerns *what households buy*—budget shares, category demand, the income elasticity of a product class—the canonical U.S. source is the Consumer Expenditure (CE) Survey administered by the Bureau of Labor Statistics. Its sampling design is worth understanding in detail, because the survey's structure dictates which estimands it can support and at what frequency. The CE's unit of observation is the **consumer unit**, defined as a single person or a group of persons who live together and share responsibility for most major expenditures. > A consumer unit is a single person, or a group of persons who live together and > who share responsibilities for most major expenditures. Within each unit, the **reference person** is the first individual named when the respondent answers, "Who is responsible for owning or renting this home?"—a convention that fixes the household head deterministically and makes cross-survey linkage well defined. The survey is not one instrument but two complementary ones, a design choice that follows directly from a measurement trade-off between recall accuracy and coverage. Large, infrequent purchases are remembered well but happen rarely; small, frequent purchases are forgotten quickly but dominate the budget in count. No single instrument measures both well, so the CE splits them: - The **Interview Survey** captures big-ticket and recurring expenditures, plus global estimates for some categories. It is a rotating-panel survey: each consumer unit is interviewed once every three months for four quarters, with roughly 6,000 units in the sample each quarter. The panel structure supports within-unit change over the year, but the four-quarter cap means it cannot follow a household across years. - The **Diary Survey** captures small-ticket, frequently purchased items that recall-based interviewing would miss. It is independent of the interview component and collects roughly 14,000 diaries from participating families each year. Published tabulations come in several time formats: **annual** tables run January–December (available since 1984) and, in a fiscal variant, July–June (since 2013); **two-year** tables span January of year one through December of year two (since 1986). Beyond the standard tables, BLS produces **experimental** research tabulations—covering, for instance, income detail or generational groups such as millennials and Gen X—available on request. The practical implication for design is that the rotating four-quarter panel is the right frame for within-year, within-household questions, while cross-year dynamics require pooling repeated cross-sections rather than following the same units. ## Inferring Demographics from Names {#sec-data-names} A recurring need in marketing research is to attach demographic attributes—gender, age, nationality—to records that carry only a name, as when a researcher has a roster of customers, reviewers, or social-media authors but no self-reported demographics. A family of services performs exactly this inference: gender from genderize.io, age from agify.io, and nationality from nationalize.io. Each returns, for a queried name, a predicted attribute together with a confidence. It is essential to read these outputs as what they are: **posterior probabilities from a name-frequency classifier**, not ground-truth labels. The underlying logic is a simple Bayesian classifier. Let $n$ be an observed name and $g$ a candidate attribute value (say, gender). The services effectively report $$ \Pr(g \mid n) \;=\; \frac{\Pr(n \mid g)\,\Pr(g)}{\displaystyle\sum_{g'} \Pr(n \mid g')\,\Pr(g')}, $$ {#eq-name-bayes} estimated from large registries of names tagged with the attribute, where $\Pr(n \mid g)$ is the frequency of name $n$ among individuals with value $g$ and $\Pr(g)$ the base rate of $g$ in the reference population. Three properties of @eq-name-bayes govern responsible use. The prediction inherits the **reference population's composition**: a classifier trained predominantly on one region's registries mispredicts systematically for names common elsewhere, so the base rate $\Pr(g)$ is not universal. The reported confidence is the **posterior itself**, so unisex or rare names yield diffuse, low-confidence posteriors that should not be hardened into deterministic labels. And because the inferred attribute is a *generated regressor*, using it as a covariate in a downstream model imports classification error into that model—the resulting attenuation and the need to propagate the first-stage uncertainty are treated in @sec-measurement-scales. Used with those caveats, name-based inference is a cheap way to recover coarse demographic structure; used naively, it manufactures measurement error correlated with name origin. ## Pitfalls and Identification {#sec-data-pitfalls} The sources in this chapter differ in unit, frequency, and provenance, but the threats to valid inference share a small number of structural causes. @fig-data-pitfalls maps each common pitfall to the source where it bites hardest and to its remedy. ```{mermaid} %%| label: fig-data-pitfalls %%| fig-cap: "Recurring threats to validity across the chapter's data sources and their remedies. Each threat is a gap between the estimand and the observable that the source actually delivers." flowchart LR A["Non-random missingness (unreported xad in Compustat)"] --> A2["Model disclosure; do not treat NULL as zero"] B["Relative, window-dependent index (Google Trends, Baidu)"] --> B2["Rescale via overlapping method (eq-gtrends-scaling)"] C["Sampling noise in indices (Trends draws)"] --> C2["Average repeated queries"] D["Generated-regressor error (name-based demographics)"] --> D2["Propagate first-stage uncertainty"] E["Recall vs. coverage trade-off (CE survey)"] --> E2["Match instrument to estimand (interview vs. diary)"] F["Rate limits / TOS (APIs, scrapers)"] --> F2["Budget requests; prefer official endpoints"] ``` The unifying discipline is the one stated at the outset: name the estimand, name the observable the source delivers, and treat the difference as a modeling problem rather than a nuisance. Reported advertising is not advertising; a relative index is not a volume; a name-based prediction is not a demographic. Each gap is tractable once it is made explicit, and ruinous when it is not. ## Key Takeaways - **Match the unit of observation to the estimand.** WRDS delivers firm-level financials, the CE Survey household expenditure, and search/engagement APIs individual behavioral traces; no source substitutes for another (@fig-data-landscape). - **Reported is not measured.** Compustat's `xad` records *disclosed* advertising; missingness is non-random and treating `NULL` as zero selects the sample on a variable correlated with outcomes (@sec-data-wrds). - **A relative index is not a volume.** Google Trends rescales to its window's maximum (@eq-gtrends-index); long daily series must be reconstructed by the overlapping-scaling method (@eq-gtrends-scaling), and absolute counts require a different source entirely. - **Survey design encodes a recall–coverage trade-off.** The CE splits big-ticket recall (interview panel) from frequent-purchase coverage (diary), so the right instrument depends on the category (@sec-data-ce). - **Name-based demographics are posteriors, not labels.** They follow a Bayesian name-frequency classifier (@eq-name-bayes) whose error is correlated with name origin and must be propagated downstream (@sec-data-names). - **Respect the platform contract.** Owner-private metrics require OAuth, public ones a key (@tbl-youtube-apis); honor rate limits and prefer official endpoints over brittle scrapers.

29.1 Firm Financial Data: WRDS

29.2 Local Document Stores: MongoDB

29.3 Search Interest: Google Trends

29.3.1 What Google Trends Measures

29.3.2 Reconstructing a Long Daily Series

29.3.3 Absolute Volume

29.4 Search Volume in China: Baidu Index

29.5 Video Engagement: YouTube

29.5.1 OAuth Path via tuber

29.5.2 Key Path via the Data API

29.6 Household Expenditure: The Consumer Expenditure Survey

29.7 Inferring Demographics from Names

29.8 Pitfalls and Identification

29.9 Key Takeaways

29.5.1 OAuth Path via `tuber`