28 Metrics
Marketing research lives and dies by measurement. A construct that cannot be operationalized into a reproducible number cannot enter a regression, cannot be priced by equity markets, and cannot be defended to a sceptical reviewer or a sceptical chief financial officer. This chapter is a working catalogue of the metrics that quantitative marketing scholars actually compute—the financial indicators that connect marketing actions to firm value, the firm-level controls drawn from accounting data, and the increasingly text- and image-based marketing constructs that machine learning has made measurable at scale. It is also a methods chapter: for the harder constructs—marketing capability above all—we give the estimator, the identifying assumptions, and runnable code that reconstructs the measure from raw Compustat and patent data.
The organizing distinction is between outcomes and inputs to outcomes. A metric such as return on marketing investment is an outcome the firm wants to maximize; metrics such as firm size, leverage, and profitability are covariates that confound or moderate the marketing–performance relationship and must be controlled. A third class—capability metrics—are neither directly observed nor simple ratios of accounting items; they are latent efficiencies recovered by econometric frontier estimation, and they demand the most care. We treat the three classes in turn, leading each with the intuition for what the number is supposed to capture and following immediately with its formal definition, its data source, and the assumptions under which it is identified.
Throughout, the practical substrate is the Wharton Research Data Services (WRDS) ecosystem: the Center for Research in Security Prices (CRSP) for market data, Compustat for accounting fundamentals, Thomson Financial 13F filings for institutional ownership, and the WRDS U.S. patents database for innovation output. The reader who works through the code will end with a firm-year panel on which any of the marketing–finance models in Chapter 23 can be estimated.
28.1 Financial Metrics
Financial metrics translate a marketing intervention into the language that capital markets and corporate boards speak. They fall into three families: return measures that scale profit by the capital deployed to earn it, value-creation measures that ask whether the firm earned more than its cost of capital, and covariate measures that characterize the firm’s size, risk, and profitability for use as controls.
28.1.1 Return on Investment and Return on Marketing Investment
The most elementary return metric scales net profit by the capital that produced it. Return on investment (ROI) is
\[ \text{ROI} = \frac{\text{Net Profit}}{\text{Investment}}, \tag{28.1}\]
a unitless ratio interpretable as the profit earned per dollar committed. ROI is attractive precisely because it is dimensionless and therefore comparable across projects of different scale, but that virtue is also its central weakness: it discards the magnitude of the investment, so a tiny project with a spectacular percentage return can dominate a large project that creates far more absolute value. ROI is a ranking device, not a value-maximization objective.
For marketing specifically, the analogue isolates the incremental effect of the marketing dollar. Return on marketing investment (ROMI), sometimes written ROIM, is
\[ \text{ROMI} = \frac{\text{IRAM} - \text{CM}}{\text{MS}}, \tag{28.2}\]
where IRAM is the incremental revenue attributable to marketing, CM is the contribution cost of the marketing investment, and MS is marketing spending. A positive ROMI signals that the marketing program returned more than it cost on the margin. The deceptively simple numerator hides the entire identification problem of the field: incremental revenue is a counterfactual quantity—the revenue that would not have occurred absent the marketing—and recovering it requires either an experiment or a credible model of the no-marketing baseline. Naively crediting marketing with all post-campaign revenue (the endogeneity of spend with demand) inflates ROMI without bound, which is why the construct is better understood as a target for causal estimation than as an accounting ratio.
ROI and ROMI are summary statistics of a causal effect, not the effect itself. Their managerial appeal—a single comparable percentage—is exactly what makes them easy to game by manipulating the denominator (under-counting the true cost base) or by attributing organic demand to the campaign. Report them alongside the identification strategy that produced the incremental numerator.
28.1.2 Economic Value Added
ROI compares profit to capital but is silent on whether that profit cleared the cost of capital. Economic value added (EVA), also called economic profit, fills that gap. It measures the residual wealth a firm generates after deducting a charge for all the capital it employs, equity as well as debt, on an after-tax basis:
\[ \text{EVA} = \text{NOPAT} - (\text{Invested Capital} \times \text{WACC}). \tag{28.3}\]
Here net operating profit after taxes (NOPAT) equals operating profit times one minus the tax rate; invested capital is the sum of debt, capital leases, and shareholders’ equity (equivalently, equity plus long-term debt measured at the start of the period); and WACC is the weighted average cost of capital, the blended return the firm must pay its providers of capital. The product \(\text{WACC} \times \text{Invested Capital}\) is the finance charge: the opportunity cost of the funds tied up in the business. EVA is positive only when operating performance exceeds that charge, which is the precise sense in which positive EVA means the firm created value rather than merely earned a profit.
The cost of capital itself is a weighted average over the firm’s capital structure,
\[ \text{WACC} = \frac{K_e \, E}{E + D} + \frac{K_d \,(1 - t)\, D}{E + D}, \tag{28.4}\]
where \(E\) and \(D\) are the market values of equity and debt, \(K_e\) is the required return on equity, \(K_d(1-t)\) is the after-tax return on debt, and \(t\) is the marginal tax rate. The after-tax adjustment on debt reflects the tax deductibility of interest.
Because invested capital can equivalently be written as total assets net of current liabilities, a common balance-sheet implementation of Equation 28.3 is
\[ \text{EVA} = \text{NOPAT} - (\text{total assets} - \text{current liabilities}) \times \text{WACC}. \tag{28.5}\]
EVA rests heavily on the book value of invested capital, which makes it most informative for asset-rich firms whose balance sheets capture the bulk of the resources they deploy. For firms whose value resides in intangibles—software, brands, data, organizational capital—book invested capital understates the true capital base, and EVA correspondingly overstates value creation. The intangibles problem is not a minor caveat; it is the reason marketing scholars increasingly favor market-based metrics over EVA for technology and consumer-brand firms.
28.1.3 Market Value Added
Where EVA is a flow concept measured each period, market value added (MVA) is the corresponding stock: the cumulative wealth the firm has created for its capital providers since inception. It is the gap between what the market says the firm’s claims are worth and what investors originally contributed:
\[ \text{MVA} = \text{market value of shares (or enterprise value)} - \text{book value of shareholders' equity}. \tag{28.6}\]
A firm with persistently positive EVA accumulates positive MVA; under the residual- income identity, MVA equals the present value of all expected future EVA. MVA thus sidesteps the period-by-period WACC estimation that EVA requires, at the cost of inheriting all the noise of market expectations: it tells you what the market believes the firm has created, not what it has demonstrably created.
28.1.4 Firm-Level Covariates from Accounting Data
Most marketing–finance studies are observational, so the credibility of any estimated marketing effect rests on controlling for the firm characteristics that jointly drive marketing decisions and financial outcomes. The literature has converged on a standard battery of accounting-based controls, each with a canonical operationalization. We collect them here with their definitions, their empirical pedigree, and the source items in Compustat.
Profitability is conventionally measured as operating income before depreciation scaled by total assets, a return-on-assets variant that strips out the financing and depreciation choices that contaminate net income (Grewal et al. 2008; McAlister et al. 2016):
\[ \text{profitability} = \frac{\text{operating income before depreciation}}{\text{total assets}}. \tag{28.7}\]
Firm size enters as the natural logarithm of total assets, the log transformation taming the extreme right skew of the firm-size distribution and rendering coefficients interpretable as elasticities (Grewal, Chandrashekaran, and Citrin 2010; McAlister et al. 2016; Nezami, Worm, and Palmatier 2018):
\[ \text{firm size} = \log(\text{total assets}). \tag{28.8}\]
Sales growth, the percentage change in gross sales, proxies for the firm’s demand-side momentum and growth opportunities (Grewal, Chandrashekaran, and Citrin 2010; Nezami, Worm, and Palmatier 2018; Rao, Agarwal, and Dahlhoff 2004). Cash flow, measured as the log of operating cash flow in millions, captures internal financing capacity and is a standard control in studies linking marketing to firm risk (Chakravarty and Grewal 2011; Malshe and Agarwal 2015).
Financial leverage scales long-term debt by the book value of assets, indexing the firm’s reliance on debt financing and its exposure to financial distress (Kashmiri and Mahajan 2017; Chakravarty and Grewal 2011):
\[ \text{financial leverage} = \frac{\text{long-term debt}}{\text{book value of assets}}. \tag{28.9}\]
Abnormal stock return is frequently dichotomized: a dummy equal to one when a firm’s stock return exceeds its industry-averaged return marks out-performers in a way robust to the heavy tails of raw returns (Markovitch, Steckel, and Yeung 2005; Chakravarty and Grewal 2011, 2016).
Beyond these workhorses, specialized studies build bespoke financial constructs. Unexpected size-adjusted advertising investment—the residual from a model of expected advertising given firm size—isolates the surprise component of marketing spend that markets have not already priced (Chakravarty and Grewal 2016; Kim and McAlister 2011; Liu, Shankar, and Yun 2017). Shareholder complaints, drawn from the RiskMetrics governance database, proxy for investor dissatisfaction and governance friction (Wies et al. 2019).
28.1.5 Book Equity
Book equity (BE) is a deceptively intricate construct because the “right” measure depends on data availability, and the field’s conventions trace to the Fama–French data library. The canonical definition is worth quoting in full:
“BE is the book value of stockholders’ equity, plus balance sheet deferred taxes and investment tax credit (if available), minus the book value of preferred stock. Depending on availability, we use the redemption, liquidation, or par value (in that order) to estimate the book value of preferred stock. Stockholders’ equity is the value reported by Moody’s or Compustat, if it is available. If not, we measure stockholders’ equity as the book value of common equity plus the par value of preferred stock, or the book value of assets minus total liabilities (in that order).” (Davis, Fama, and French 2000)
The hierarchy of fallbacks is the point: preferred stock is valued by redemption, then liquidation, then par; stockholders’ equity is taken from the cleanest available source and reconstructed only when necessary. The Compustat implementation coalesces the preferred-stock items in order of preference and adds deferred taxes:
28.1.6 Net Contribution
The bridge from marketing spend to profit is the net contribution function, which underlies the advertising-budgeting and response-modeling literature. Net contribution is gross margin times sales revenue, less the cost of the marketing that produced those sales:
\[ \text{NC} = m \times S(a) - k a, \tag{28.10}\]
where \(m\) is gross margin, \(S(a)\) is the sales-response function mapping marketing effort \(a\) to sales, and \(k\) is the unit cost of effort. The first-order condition \(m \, S'(a^\*) = k\) characterizes the optimal effort \(a^\*\): spend until the marginal gross profit from an additional unit of effort equals its marginal cost. The shape of \(S(\cdot)\)—concave, S-shaped, or saturating—governs whether an interior optimum exists, which is the recurring substantive question of the advertising-response literature.
28.1.7 A Reference Map of Compustat Items
The metrics above all reduce to combinations of a relatively small set of WRDS data items. Because reproducing any one of them requires knowing the exact item mnemonic and its source file, Table 28.1 assembles the core mapping. All items without a CRSP annotation come from Compustat Fundamentals Annual; the mnemonics follow the WRDS data-items reference.1
| Metric | Data item | Source file |
|---|---|---|
| Book value of equity | PRCC_C x CSHO | CRSP/Compustat Merged |
| Capital intensity | CAPX / AT | Cash flow; balance sheet |
| Cash flow | (IBC + DP) / AT | Cash flow; income statement |
| Cash holdings | CHE / AT | Balance sheet |
| Cost of capital (proxy) | XINT / DLC | Income statement; balance sheet |
| Earnings per share | NI / CSHO | Income statement; misc. |
| Firm size | log(AT) | Balance sheet |
| Leverage | (DLTT + DLC) / SEQ | Balance sheet |
| Market-to-book ratio | MKVALT / BKVLPS | Supplemental; balance sheet |
| Market value | MKVALT or CSHO x PRCC_F | Supplemental; misc. |
| Payout ratio | (DVP + DVC + PRSTKC) / IB | Income statement; cash flow |
| R&D intensity | XRD / AT | Income statement; balance sheet |
| Return on assets (ROA) | NI / AT | Income statement; balance sheet |
| Return on equity (ROE) | NI / (CSHO x PRCC_F) | Income statement; supplemental |
| Return on investment (ROI) | NI / ICAPT | Income statement; balance sheet |
| Tangibility | PPENT / AT | Balance sheet |
| Tobin’s Q | (AT + CSHO x PRCC_F - CEQ) / AT | Balance sheet; supplemental |
| Total equity | PSTKC + CSHO | Balance sheet; misc. |
A second, finance-oriented battery—used in studies of corporate investment, financing, and payout policy—follows the conventions catalogued by Kahle and Stulz (2017). Table 28.2 reproduces the construction rules, which differ from Table 28.1 chiefly in their use of lagged assets in denominators (to avoid mechanical contemporaneous correlation) and in the explicit treatment of missing R&D as zero.
| Category | Metric | Construction |
|---|---|---|
| Valuation | Tobin’s Q | (AT + CSHO x PRCC_F - CEQ) / AT |
| Valuation | Market cap | prc x shrout (CRSP) |
| Valuation | Revenue Herfindahl | revt_i^2 / sum(revt) within 3-digit NAICS x year |
| Investment | CapEx / assets | capx / lag(at) |
| Investment | R&D / assets | xrd / lag(at); missing R&D set to 0 |
| Investment | Fixed assets / assets | ppent / at |
| Investment | Cash / assets | che / at |
| Profitability | Operating cash flow / assets | (oibdp - xint - txt) / lag(at) |
| Profitability | ROA | ib / at |
| Financing | Book leverage | (dltt + dlc) / at |
| Financing | Market leverage | (dltt + dlc) / (at - ceq + che x prcc_f) |
| Financing | Net leverage | (dltt + dlc - che) / at |
| Financing | Net equity issuance | (sstk - prstkc) / lag(at) |
| Ownership | Institutional ownership | % shares held by institutions (13F) |
| Ownership | Blockholder | institution holding >= 10% of shares (13F) |
| Payout | Dividends / assets | dvc / lag(at) |
| Payout | Repurchase / assets | (prstkc - pstk) / lag(at) |
| Payout | Total payout / assets | (dvc + prstkc) / lag(at) |
Institutional-ownership and blockholder variables come from Thomson Financial 13F filings; the macro denominator for the market-cap/GDP ratio is series GDPA from the U.S. Bureau of Economic Analysis.
28.1.8 Industry Concentration and Diversity
Several of the metrics above require an industry-concentration index, and the literature has settled on the Herfindahl form. For revenue shares \(s_i\) within an industry, the Herfindahl–Hirschman index is \(\sum_i s_i^2\); the revenue-Herfindahl in Table 28.2 computes this within each 3-digit NAICS industry and year. The same quantity, applied to sector market shares, is the Simpson diversity index familiar from ecology—the two are algebraically identical, both equal to the probability that two randomly drawn units belong to the same category. High Herfindahl (low Simpson diversity) signals a concentrated industry; the index thus does double duty as a competition control and as a measure of how diversified a firm’s or market’s activity is across sectors.
28.2 Marketing Metrics
The financial metrics above are computed from structured accounting data. The marketing constructs in this section are different in kind: trust, sentiment, willingness to pay, purchase intention, and brand reputation are psychological states that classically required surveys to measure but are now increasingly recovered from unstructured text and images by machine learning. The methodological frontier here is the use of pre-trained language and vision models to convert social-media content into validated marketing measures at a scale surveys cannot reach.
28.2.1 Trust
Trust is the willingness to rely on an exchange partner in whom one has confidence, and it has long been measured by multi-item attitudinal scales. The frontier substitutes behavioral and relational signals in social-media data for self-report: Roy et al. (2017) develop an algorithmic measure of brand trust from the structure and content of consumers’ social-media interactions, demonstrating that trust leaves a computational trace that can be extracted without surveying anyone.
28.2.2 Sentiment
Sentiment—the valence of expressed affect—is core to human communication and is the single most demanded text-analytic measure in marketing, applied to social media, news, customer feedback, and corporate communications. The central practical question is which method to use, because the menu ranges from simple lexicons that map words to polarity scores to transfer-learning language models that are far more accurate but far more demanding.
Hartmann et al. (2023) resolve this question empirically with a meta-analysis spanning 272 datasets and roughly twelve million sentiment-labeled documents. Their headline finding is that transfer-learning models—pre-trained transformers fine-tuned on sentiment—deliver the best performance, outperforming lexicons by more than twenty percent in accuracy on average. The advantage is not uniform: it widens with the number of sentiment classes and is moderated by text length, and the leaderboard-topping benchmark model is not always the best choice for a given research setting. Crucially for reproducibility, the authors supply a pre-trained model (SiEBERT) and open-source scripts, lowering the barrier to applying state-of-the-art sentiment analysis. The practical lesson is that method choice should be made deliberately against the research question, the data, and the available computational resources—not by reflexively reaching for a lexicon because it is convenient.
28.2.3 Willingness to Pay
Willingness to pay (WTP)—the maximum price a consumer will accept for a good—is the demand-side primitive that underlies pricing and welfare analysis. Recovering it from naturally occurring text rather than from elicitation experiments is an active frontier; He, Anderson, and Rucker (2023) represents recent work in this direction, extracting WTP signals from expressed consumer language.
28.2.4 Purchase Intention
Purchase intention—the self-reported likelihood of buying—is a leading indicator of behavior and a workhorse dependent variable. Hartmann et al. (2021) show that it can be inferred from images, not just text, and in doing so overturn a common assumption about social-media metrics. Smartphones have made it trivial for consumers to share branded imagery, and the authors classify that imagery into three types using convolutional neural networks: packshots (the product alone), consumer selfies (a consumer’s face shown with the brand), and brand selfies (the product held from the consumer’s own visual perspective, with no consumer face visible). Applying language models to social-media responses across more than 250,000 brand-image posts from 185 brands on Twitter and Instagram, they find a revealing dissociation: consumer selfies generate more likes and comments, but brand selfies induce higher purchase intentions. Engagement metrics and purchase intent diverge.
The dissociation has managerial bite. In a display-advertising field test, brand selfies earned higher click-through rates than consumer selfies, and a laboratory experiment traced the mechanism to self-reference: the first-person perspective of the brand selfie invites the viewer to imagine holding the product themselves. The broader methodological point is that machine learning can decode marketing-relevant constructs from multimedia content, and that traditional engagement counts (likes, comments) may mislead about the constructs managers actually care about. The purchase-intention classifier is released as a fine-tuned RoBERTa model.2
28.2.5 Brand Reputation
Brand reputation—the aggregate esteem in which a brand is held—has migrated from survey trackers to real-time social listening. Rust et al. (2021) measure brand reputation directly from Twitter data, demonstrating that a construct historically captured by periodic, expensive surveys can be estimated continuously from public conversation, with the attendant gains in timeliness and the attendant risks of platform-specific selection.
28.3 Marketing Capability
Capability is the most demanding metric in this chapter and the one that most rewards careful estimation, because it is not observed at all—it is a latent efficiency that must be inferred from the gap between what a firm achieves and what the best firms achieve with comparable inputs. The construct originates with Dutta, Narasimhan, and Rajiv (1999), who define a firm’s capability as “its ability to deploy the resources (inputs) available to it to achieve the desired objective(s) (output).” The higher a firm’s functional capability, the more efficiently it converts its inputs into the relevant functional output; equivalently, the lower its functional inefficiency, the higher its capability. This input–output framing is what makes stochastic frontier analysis the natural estimator.
28.3.1 The Substantive Argument
The motivating insight of Dutta, Narasimhan, and Rajiv (1999) is that in high-technology markets, raw technological prowess is not enough. A firm can possess formidable research and development (R&D) capability—generating a stream of high-quality innovations—yet fail commercially because it lacks the marketing capability to translate those innovations into products consumers value and buy. Marketing capability has its largest effect on quality-adjusted innovation output precisely for firms with a strong technological base: the firms that benefit most from great marketing capability are those that already have a strong R&D foundation, because only they have innovations worth commercializing. The interaction of marketing and R&D capabilities is therefore the single most important determinant of firm performance—high-technology firms must be able both to generate innovation continuously and to commercialize it.
This argument dictates the estimation strategy. Three capabilities are estimated jointly because they feed one another: marketing capability drives sales from a firm’s technological base, advertising stock, marketing stock, customer relationships, and installed base; R&D capability drives quality-adjusted technological output; and operations capability drives the cost of production. The functional relationships Dutta, Narasimhan, and Rajiv (1999) posit are, schematically,
\[ \text{Sales} = f(\text{technological base},\ \text{advertising stock},\ \text{marketing stock},\ \text{customer relationships},\ \text{installed base}), \]
\[ \text{Quality-adjusted output} = f(\text{technological base},\ \text{cumulative R\&D},\ \text{marketing capability}), \]
\[ \text{Cost of production} = f(\text{output},\ \text{cost of capital},\ \text{labor cost},\ \text{technological base},\ \text{marketing capability}). \]
The appearance of marketing capability inside the R&D and operations equations is the formal expression of the substantive claim that the capabilities are interdependent, not separable.
Figure 28.1 makes the recursive structure explicit.
flowchart TD
A[Advertising stock] --> M[Marketing frontier:<br/>log sales]
MK[Marketing stock] --> M
TB[Technological base] --> M
REC[Receivables / CRM] --> M
IB[Installed base] --> M
M --> ME[Marketing efficiency]
ME --> R[R&D frontier:<br/>log tech output]
RD[R&D stock] --> R
TB --> R
R --> RE[R&D efficiency]
ME --> O[Operations frontier:<br/>log COGS]
LC[Labor cost] --> O
CC[Cost of capital] --> O
TB --> O
O --> OE[Operations efficiency]
28.3.2 Stock Variables and the Koyck Transformation
Marketing and innovation investments do not affect sales only in the year they are incurred; their effect decays geometrically over time. The standard device for turning a flow of expenditure into a stock is the Koyck geometric-lag transformation (Koyck 1954). For an expenditure flow \(x_t\) and a retention rate \(\lambda \in (0,1)\), the stock is
\[ \text{Stock}_t = \sum_{j=0}^{t-1} \lambda^{j} \, x_{t-j}, \tag{28.11}\]
so that each past dollar contributes \(\lambda^j\) of its original force after \(j\) years. The advertising-stock literature anchors \(\lambda\) empirically: weights of 0.4 (Peles 1971) and 0.5 (Z. Wang and Kim 2017) are standard for advertising stock, and Dutta, Narasimhan, and Rajiv (2005) use a weight of 0.5 for marketing expenditure and 0.4 for R&D expenditure (their p. 281). The choice of \(\lambda\) is consequential—too high a retention rate over-credits ancient spending—and should be defended against the estimated carryover in the relevant category rather than imposed by habit.
28.3.3 Measuring R&D Output: Innovativeness and Width
A raw patent count is a poor measure of technological output because patents differ enormously in quality; Dutta, Narasimhan, and Rajiv (1999) deliberately avoid it in favor of two citation-based, quality-adjusted measures. The first, innovativeness, follows the citation-weighting tradition of Trajtenberg (1990b) and Trajtenberg (1990a): a patent that is cited often is more valuable, so patents are weighted by how far their citation count exceeds the industry norm. The second, width of applicability, follows Jaffe, Trajtenberg, and Henderson (1993): a patent cited by firms in other industries has broader applicability, so patents are weighted by the share of their citations that come from outside the focal industry.
Concretely, the innovativeness-adjusted output is built in three steps. First, compute the average number of citations received by all sample patents within an industry—defined at one-, two-, three-, and four-digit SIC granularity—where the original study used a single mean across all firms and years. Second, weight each firm’s patent by its citation count divided by that industry-sample average. Third, sum the citation-weighted patents within a firm-year. The width-adjusted output parallels this: for each patent compute the proportion of its citations originating outside the focal SIC code; weight the patent by that proportion divided by the industry-average proportion; and sum the weighted patents within a firm-year.
28.3.4 The Stochastic Frontier Estimator
The three capabilities are recovered by stochastic frontier analysis (SFA), which is the right tool precisely because capability is defined as efficiency relative to a best-practice frontier. For firm \(i\) in year \(t\) with output \(y_{it}\) and inputs \(\mathbf{x}_{it}\), the production frontier is
\[ \log y_{it} = \mathbf{x}_{it}' \boldsymbol{\beta} + v_{it} - u_{it}, \tag{28.12}\]
where the composed error decomposes into a symmetric noise term \(v_{it} \sim \mathcal{N}(0, \sigma_v^2)\) and a one-sided inefficiency term \(u_{it} \ge 0\). The frontier \(\mathbf{x}_{it}'\boldsymbol{\beta} + v_{it}\) is the maximum attainable output; the firm falls short of it by \(u_{it}\). Capability is then the technical efficiency
\[ \text{TE}_{it} = \exp(-u_{it}) \in (0, 1], \tag{28.13}\]
recovered as the conditional expectation \(\mathbb{E}[\exp(-u_{it}) \mid v_{it} -
u_{it}]\) following Jondrow and others. For a production frontier, inefficiency decreases output, so the sign on \(u_{it}\) is negative (ineffDecrease = TRUE in the code below); for a cost frontier—used for the operations capability, where the output is the cost of goods sold to be minimized—inefficiency increases cost and the sign reverses (ineffDecrease = FALSE).
The identifying assumptions are exactly those that make SFA both powerful and fragile. First, the inefficiency term must be one-sided and distributionally specified (half-normal, truncated-normal, or exponential); the efficiency estimates are not robust to gross misspecification of this distribution. Second, the inputs \(\mathbf{x}_{it}\) must be exogenous to \(u_{it}\)—if firms with high latent capability systematically invest more, the frontier is biased and the recovered efficiencies absorb the endogeneity. Third, the log specification requires strictly positive inputs and outputs, which is why the data pipeline below replaces zeros and missing stocks with small positive constants. None of these assumptions is innocuous, and the interpretation of the recovered efficiencies as “capability” is only as good as the frontier is correctly specified.
28.3.5 Data Construction
We now build the firm-year panel from WRDS. The pipeline connects to the WRDS PostgreSQL server, pulls Total Q (a refined Tobin’s Q), the Compustat fundamentals needed for the input stocks, and the patent data needed for the R&D-output measures, and assembles them into a single panel. The replication is constrained by the WRDS U.S. patents coverage (2011–2019 for the citation files), so the post-2010 window is used for estimation; the original 1985–1994 study period would require building a name-matching algorithm against raw USPTO bulk data.
Code
library(RPostgres)
library(tidyverse)
# WRDS connection. Supply your own credentials via environment variables;
# never hard-code passwords into a reproducible script.
wrds <- dbConnect(
Postgres(),
host = "wrds-pgdata.wharton.upenn.edu",
port = 9737,
dbname = "wrds",
sslmode = "require",
user = Sys.getenv("wrds_user"),
pass = Sys.getenv("wrds_pass")
)Total Q provides the valuation outcome used in some replications below.
The fundamentals pull retrieves the income-statement and balance-sheet items needed to build the input stocks, joins industry classifications, and interpolates missing firm-years. Spline interpolation is used in preference to linear interpolation because the underlying series (advertising, R&D, sales) are smooth and trending; linear fill would introduce kinks that the frontier would misread as inefficiency.
Code
res <- dbSendQuery(
wrds,
"SELECT DISTINCT gvkey, fyear, conm, curcd, cogs, rect, revt, sale,
xad, xrd, xsga, ppegt, emp, act, xopr, xint, dlc,
xlr, uxintd, invfg
FROM comp_na_daily_all.funda
WHERE fyear >= 2000 AND gvkey IS NOT NULL AND
sale IS NOT NULL AND sale > 0 AND revt IS NOT NULL"
)
capability <- dbFetch(res, n = -1)
dbClearResult(res)
# Industry classifications.
res <- dbSendQuery(wrds,
"SELECT gvkey, gind, gsubind, naics, sic
FROM comp_na_daily_all.names")
ind <- dbFetch(res, n = -1)
dbClearResult(res)
capability <- capability |>
left_join(ind, by = join_by(gvkey)) |>
rename(year = fyear) |>
unique() |>
arrange(gvkey, year) |>
group_by(gvkey, year) |>
slice(1) |> # one record per firm-year
ungroup()
# Spline interpolation across the non-missing span of each series.
spline_interpolate <- function(x) {
nz <- which(!is.na(x))
if (length(nz) == 0) return(x)
first_non_na <- nz[1]
last_non_na <- nz[length(nz)]
x[first_non_na:last_non_na] <-
zoo::na.spline(x[first_non_na:last_non_na], na.rm = TRUE)
x
}
library(zoo)
capability <- capability |>
group_by(gvkey) |>
arrange(year, .by_group = TRUE) |>
fill(conm, curcd, gind, gsubind, naics, sic, .direction = "downup") |>
ungroup() |>
group_by(gvkey) |>
complete(year = min(year):max(year)) |> # fill gaps within firm
arrange(year, .by_group = TRUE) |>
fill(conm, curcd, gind, gsubind, naics, sic, .direction = "downup") |>
mutate(across(
c(xrd, xsga, xad, emp, ppegt, xint, act, invfg, xlr,
cogs, rect, revt, sale, xopr, dlc),
spline_interpolate
)) |>
ungroup() |>
arrange(gvkey, year)
capability |> write_rds(file.path("data", "capability", "capability.rds"))The patent–firm linkage assigns patents to gvkeys, keeping for each patent the match with the highest WRDS confidence score and breaking ties deterministically. Annual patent counts per firm follow.
Code
res <- dbSendQuery(wrds,
"SELECT gvkey, link_bdate, patnum, wrds_score
FROM wrdsapps_patents.uspatents_gvkey_linking
WHERE gvkey IS NOT NULL")
wrdsapps_patents_link <- dbFetch(res, n = -1) |>
group_by(patnum) |> slice_max(order_by = wrds_score, n = 1) |> ungroup() |>
group_by(patnum) |> slice(1) |> ungroup()
dbClearResult(res)
res <- dbSendQuery(wrds,
"SELECT DISTINCT patnum, grantdate, cited_patnum, cited_pat_gdate, cite_type
FROM wrdsapps_patents.uspatents_citations")
wrdsapps_patents_citations <- dbFetch(res, n = -1)
dbClearResult(res)
patent_output <- wrdsapps_patents_link |>
mutate(year = year(link_bdate)) |>
group_by(gvkey, year) |>
summarise(pat_count = n(), .groups = "drop") |>
complete(gvkey, year = range(year)) |>
mutate(across(-c(gvkey, year), ~ replace_na(., 0)))
patent_output |> write_rds(file.path("data", "capability", "patent_output.rds"))28.3.5.1 Innovativeness-adjusted output
Implementing the three-step innovativeness measure requires the industry citation averages at each SIC granularity. The following computes citations per patent per year, the industry averages at one- through four-digit SIC, the per-patent weights, and finally the firm-year sums.
Code
# Citations received per patent per year, with firm and industry attached.
df_cite_patent_year <- wrdsapps_patents_citations |>
group_by(year = year(grantdate), patnum = cited_patnum) |>
summarise(cite_count = n(), .groups = "drop") |>
inner_join(wrdsapps_patents_link, by = join_by(patnum)) |>
filter(year(link_bdate) <= year) |> # patent predates the citing year
inner_join(ind |> select(gvkey, sic) |> na.omit() |> unique(),
by = "gvkey") |>
select(-c(link_bdate, wrds_score)) |>
unique()
# Industry-year average citations at each SIC granularity.
ind_avg_patent_sic <- df_cite_patent_year |>
group_by(year, sic) |>
summarize(ind_avg_year_patent_sic = mean(cite_count, na.rm = TRUE),
.groups = "drop")
ind_avg_patent_sic3 <- df_cite_patent_year |>
mutate(sic3 = substr(sic, 1, 3)) |>
group_by(year, sic3) |>
summarize(ind_avg_year_patent_sic3 = mean(cite_count, na.rm = TRUE),
.groups = "drop")
ind_avg_patent_sic2 <- df_cite_patent_year |>
mutate(sic2 = substr(sic, 1, 2)) |>
group_by(year, sic2) |>
summarize(ind_avg_year_patent_sic2 = mean(cite_count, na.rm = TRUE),
.groups = "drop")
ind_avg_patent_sic1 <- df_cite_patent_year |>
mutate(sic1 = substr(sic, 1, 1)) |>
group_by(year, sic1) |>
summarize(ind_avg_year_patent_sic1 = mean(cite_count, na.rm = TRUE),
.groups = "drop")
# Per-patent weights = citations / industry average; sum within firm-year.
tech_innv <- df_cite_patent_year |>
mutate(sic1 = substr(sic, 1, 1),
sic2 = substr(sic, 1, 2),
sic3 = substr(sic, 1, 3)) |>
inner_join(ind_avg_patent_sic, by = join_by(year, sic)) |>
inner_join(ind_avg_patent_sic3, by = join_by(year, sic3)) |>
inner_join(ind_avg_patent_sic2, by = join_by(year, sic2)) |>
inner_join(ind_avg_patent_sic1, by = join_by(year, sic1)) |>
mutate(
weight_year_patent_sic = cite_count / ind_avg_year_patent_sic,
weight_year_patent_sic3 = cite_count / ind_avg_year_patent_sic3,
weight_year_patent_sic2 = cite_count / ind_avg_year_patent_sic2,
weight_year_patent_sic1 = cite_count / ind_avg_year_patent_sic1
) |>
group_by(gvkey, year) |>
summarise(
tech_inv_patent = sum(weight_year_patent_sic, na.rm = TRUE),
tech_inv_patent_sic3 = sum(weight_year_patent_sic3, na.rm = TRUE),
tech_inv_patent_sic2 = sum(weight_year_patent_sic2, na.rm = TRUE),
tech_inv_patent_sic1 = sum(weight_year_patent_sic1, na.rm = TRUE),
.groups = "drop"
)
tech_innv |> write_rds(file.path("data", "capability", "tech_innv.rds"))Code
tech_innv <- read_rds(file.path("data", "capability", "tech_innv.rds"))
# Firms with no patents in a year get a small positive value so that
# the log frontier is defined.
tech_innv <- tech_innv |>
complete(gvkey, year = min(tech_innv$year):max(tech_innv$year)) |>
mutate(across(-c(gvkey, year), ~ replace_na(., 0.1)))28.3.5.2 Width-of-applicability output
The width measure counts, for each patent, the share of its citations that come from firms in a different SIC code, then weights by that share relative to the industry-average share.
Code
tech_width_all <- wrdsapps_patents_citations |>
inner_join(
wrdsapps_patents_link |> select(-wrds_score) |> rename(gvkey_cite = gvkey),
by = join_by(patnum)) |>
select(-link_bdate) |>
inner_join(
wrdsapps_patents_link |> select(-wrds_score) |> rename(gvkey_cited = gvkey),
by = join_by(cited_patnum == patnum)) |>
filter(year(cited_pat_gdate) >= year(link_bdate)) |>
select(-link_bdate) |>
inner_join(ind |> select(gvkey, sic) |> rename(sic_cite = sic),
by = join_by(gvkey_cite == gvkey)) |>
inner_join(ind |> select(gvkey, sic) |> rename(sic_cited = sic),
by = join_by(gvkey_cited == gvkey)) |>
mutate(outside = if_else(sic_cite == sic_cited, 0, 1)) |>
unique()
df_cite_patent_year_outside <- tech_width_all |>
filter(outside == 1) |>
group_by(patnum = cited_patnum, year = year(grantdate),
gvkey = gvkey_cited, sic = sic_cited) |>
summarise(outside_cite_count = n(), .groups = "drop") |>
unique()
tech_width <- df_cite_patent_year_outside |>
inner_join(df_cite_patent_year, by = join_by(patnum, year, gvkey, sic)) |>
unique() |>
mutate(prop = outside_cite_count / cite_count) |>
mutate(mean_prop = mean(prop)) |>
mutate(weight = prop / mean_prop, weighted_patent = weight * cite_count) |>
group_by(gvkey, year) |>
summarize(tech_width = sum(weighted_patent), .groups = "drop")
tech_width |> write_rds(file.path("data", "capability", "tech_width.rds"))28.3.5.3 Historical patent output
To extend coverage backward, a historical patent file (1981–2012) drawn from the NBER/Searle dynamic-assignee data supplements the WRDS counts.3 The two sources are reconciled into a single patent-count series, preferring the historical count where both exist.
Code
patent_output_hist <-
rio::import(file.path("data", "capability", "patents_conveyance.dta")) |>
select(h_assignee_code, grant_date, patentid) |>
na.omit() |>
inner_join(
rio::import(file.path("data", "capability", "dynass_nber_searle.dta")) |>
select(gvkey1, h_assignee_code) |> na.omit(),
by = join_by(h_assignee_code)) |>
mutate(year = year(grant_date)) |>
rename(gvkey = gvkey1) |>
group_by(gvkey, year) |>
summarise(patent_count = n(), .groups = "drop") |>
complete(gvkey, year = range(year)) |>
mutate(across(-c(gvkey, year), ~ replace_na(., 0)))
patent_output_hist |>
write_rds(file.path("data", "capability", "patent_output_hist.rds"))Code
patent_output_hist <-
read_rds(file.path("data", "capability", "patent_output_hist.rds"))
patent_output_total <- patent_output |>
full_join(patent_output_hist, by = join_by(gvkey, year)) |>
mutate(pat_count = case_when(
!is.na(patent_count) ~ patent_count,
!is.na(pat_count) ~ pat_count,
TRUE ~ 0
)) |>
select(-patent_count) |>
arrange(gvkey, year) |>
filter(year >= 2000)28.3.5.4 Building the input stocks
The final construction step applies the Koyck transformation of Equation 28.11 (with \(\lambda = 0.4\)) to each investment flow, producing the patent, technology, marketing, advertising, installed-base, and R&D stocks, and imputes the cost-of-capital and labor-cost inputs from industry–year means where firm-level values are missing. The cost of capital is proxied by interest expense over current debt (\(\text{xint}/\text{dlc}\)) and labor cost by per-employee staff expense (\(\text{xlr}/\text{emp}\)); both are filled hierarchically from four-digit down to one-digit SIC industry averages so that no firm-year is dropped for a single missing input.
Code
lambda <- 0.4
df_cap <- capability |>
mutate(sic1 = substr(sic, 1, 1),
sic2 = substr(sic, 1, 2),
sic3 = substr(sic, 1, 3)) |>
filter(xad >= 0, xrd >= 0) |>
left_join(tech_innv, by = join_by(gvkey, year)) |>
left_join(tech_width, by = join_by(gvkey, year)) |>
left_join(patent_output_total, by = join_by(gvkey, year)) |>
mutate(
pat_count = if_else(is.na(pat_count), 0, pat_count),
dlc = if_else(dlc == 0, NA, dlc),
costofcapital = xint / dlc, # interest expense / current debt
emp = if_else(emp == 0, NA, emp),
xlr = if_else(xlr <= 0, NA, xlr),
laborcost = xlr / emp
)
# Hierarchical industry-year imputation of cost of capital and labor cost.
impute_grp <- function(df, ...) {
df |>
group_by(...) |>
mutate(
costofcapital = ifelse(is.na(costofcapital),
mean(costofcapital, na.rm = TRUE), costofcapital),
laborcost = ifelse(is.na(laborcost),
mean(laborcost, na.rm = TRUE), laborcost)
) |>
ungroup()
}
df_cap <- df_cap |>
impute_grp(sic, year) |> impute_grp(sic3, year) |>
impute_grp(sic2, year) |> impute_grp(sic1, year)
# Koyck geometric-lag stocks (equation @eq-koyck) for each investment flow.
koyck_stock <- function(x, lambda) {
map_dbl(seq_along(x), ~ {
if (all(is.na(x[1:.x]))) NA_real_
else sum(x[1:.x] * lambda ^ (.x - seq_along(x[1:.x])), na.rm = TRUE)
})
}
df_cap <- df_cap |>
group_by(gvkey) |>
arrange(year, .by_group = TRUE) |>
mutate(
pat_stock = koyck_stock(pat_count, lambda),
techbase_innv = koyck_stock(tech_inv_patent, lambda),
techbase_width= koyck_stock(tech_width, lambda),
marstock = koyck_stock(xsga, lambda),
adstock = koyck_stock(xad, lambda),
installedbase = koyck_stock(sale, lambda),
rdstock = koyck_stock(xrd, lambda)
) |>
ungroup()
df_cap |> write_rds(file.path("data", "capability", "df_cap.rds"))28.3.6 Estimating the Frontiers: Dutta, Narasimhan, and Rajiv (1999) Replication
With the panel built, the three frontiers of Equation 28.12 are estimated with the frontier package. Each capability is estimated twice—once with the innovativeness-adjusted technology measure and once with the width-adjusted measure—so that the robustness of the recovered efficiencies to the output-quality definition can be assessed. The marketing frontier is a production frontier (maximize sales), the operations frontier is a cost frontier (minimize cost of goods sold), and the recovered marketing efficiency is fed forward into the R&D and operations frontiers, operationalizing the interdependence of capabilities.
Code
# Expensive: 6 stochastic-frontier MLE fits (tens of minutes), so — like every other
# frontier block in this chapter — it is pre-computed and saved to df_cap_panel_dutta.rds
# below, then read back by the analysis chunks. Set eval: true to re-estimate from scratch.
library(frontier)
# Marketing capability: production frontier for log sales.
cap_mar_innv <- frontier::sfa(
log(sale) ~ log(adstock) + log(marstock) + log(techbase_innv) +
log(rect) + log(installedbase) + sic1 + factor(year),
ineffDecrease = TRUE, # maximize: inefficiency lowers output
timeEffect = TRUE,
data = df_cap_panel_dutta
)
df_cap_panel_dutta$mar_eff_innv <-
frontier::efficiencies(cap_mar_innv, asInData = TRUE)
cap_mar_width <- frontier::sfa(
log(sale) ~ log(adstock) + log(marstock) + log(techbase_width) +
log(rect) + log(installedbase) | sic1 + factor(year),
ineffDecrease = TRUE, timeEffect = TRUE, data = df_cap_panel_dutta
)
df_cap_panel_dutta$mar_eff_width <-
frontier::efficiencies(cap_mar_width, asInData = TRUE)
# R&D capability: marketing efficiency enters as an input (interdependence).
cap_rd_innv <- sfa(
log(tech_inv_patent) ~ log(techbase_innv) + log(rdstock) +
log(mar_eff_innv) + log(mar_eff_innv) * log(rdstock) |
sic1 + factor(year),
timeEffect = TRUE, ineffDecrease = TRUE, data = df_cap_panel_dutta
)
df_cap_panel_dutta$rd_eff_innv <-
frontier::efficiencies(cap_rd_innv, asInData = TRUE)
cap_rd_width <- sfa(
log(tech_width) ~ log(techbase_width) + log(rdstock) +
log(mar_eff_width) + log(mar_eff_width) * log(rdstock) |
sic1 + factor(year),
timeEffect = TRUE, ineffDecrease = TRUE, data = df_cap_panel_dutta
)
df_cap_panel_dutta$rd_eff_width <-
frontier::efficiencies(cap_rd_width, asInData = TRUE)
# Operations capability: cost frontier for log COGS (minimize).
cap_op_innv <- sfa(
log(cogs) ~ log(invfg) + log(laborcost) + log(costofcapital) +
log(techbase_innv) + log(mar_eff_innv) | sic1 + factor(year),
timeEffect = TRUE, ineffDecrease = FALSE, # minimize: inefficiency raises cost
data = df_cap_panel_dutta
)
df_cap_panel_dutta$op_eff_innv <-
frontier::efficiencies(cap_op_innv, asInData = TRUE)
cap_op_width <- sfa(
log(cogs) ~ log(invfg) + log(laborcost) + log(costofcapital) +
log(techbase_width) + log(mar_eff_width) | sic1 + factor(year),
timeEffect = TRUE, ineffDecrease = FALSE, data = df_cap_panel_dutta
)
df_cap_panel_dutta$op_eff_width <-
frontier::efficiencies(cap_op_width, asInData = TRUE)
df_cap_panel_dutta |>
write_rds(file.path("data", "capability", "df_cap_panel_dutta.rds"))The recovered efficiencies can then be correlated with each other and with sales to check that they behave as capability measures should—positively associated with performance and only moderately correlated with one another, since a firm strong in marketing need not be strong in R&D.
An alternative implementation uses the sfa package in place of frontier; the two agree up to numerical tolerance.
28.3.7 Alternative Capability Specifications
The Dutta frontier is one of several closely related specifications, and the panel built above supports the others with minor changes to the input set. We give three published variants, each differing in how output is measured and which stocks enter the frontier; the substantive payoff is that the recovered capability efficiencies are reassuringly correlated across specifications, lending the construct convergent validity.
28.3.7.1 Saboo, Kumar, and Anand (2017)
This specification uses lagged sales as the installed-base proxy and patent counts (rather than citation-weighted output) for the R&D frontier; the operations frontier is a standard cost frontier in operating expense. It is computationally heavy because of the lag structure.
Code
library(frontier)
df_cap_panel_saboo <- df_cap |>
group_by(gvkey) |> arrange(year) |>
mutate(sale_t_1 = dplyr::lag(sale, n = 1)) |>
ungroup() |>
pdata.frame(c("gvkey", "year"))
cap_mar <- sfa(
log(sale) ~ log(xsga) + log(rect) + log(sale_t_1) | sic1 + factor(year),
ineffDecrease = TRUE, timeEffect = TRUE, data = df_cap_panel_saboo
)
df_cap_panel_saboo$mar_eff <- efficiencies(cap_mar, asInData = TRUE)
cap_rd <- sfa(
log(pat_count) ~ log(rdstock) + log(pat_stock) | sic1 + factor(year),
timeEffect = TRUE, ineffDecrease = TRUE, data = df_cap_panel_saboo
)
df_cap_panel_saboo$rd_eff <- efficiencies(cap_rd, asInData = TRUE)
cap_op <- sfa(
log(xopr) ~ log(act) + log(ppegt) + log(emp) | sic1 + factor(year),
timeEffect = TRUE, ineffDecrease = FALSE, data = df_cap_panel_saboo
)
df_cap_panel_saboo$op_eff <- efficiencies(cap_op, asInData = TRUE)28.3.7.2 Elhelaly and Ray (2023)
This variant (with Koyck weight 0.5) enriches each frontier with three-year accumulated stocks. The marketing frontier adds both advertising and its stock, both marketing expense and its stock, and both current and three-year receivables; the R&D frontier uses three-year patent and R&D accumulations; and the operations frontier is a cost frontier:
\[ \begin{aligned} \log(\text{sales}) &= \log(\text{xad}) + \log(\text{adstock}) + \log(\text{xsga}) + \log(\text{marstock}) \\ &\quad + \log(\text{rect}) + \log(\text{rec}) + \log(\text{installedbase}), \end{aligned} \tag{28.14}\]
where \(\text{rec}\) is accounts receivable summed over the three years prior;
\[ \log(\text{patent}) = \log(\text{patstock}) + \log(\text{xrd}) + \log(\text{accumrd}), \tag{28.15}\]
where \(\text{patstock}\) is the three-year patent total, \(\text{xrd}\) is total R&D expense, and \(\text{accumrd}\) is the three-year R&D total; and
\[ \log(\text{cogs}) = \log(\text{output}) + \log(\text{laborcost}) + \log(\text{costofcapital}), \tag{28.16}\]
where \(\text{output}\) is the dollar value of output, \(\text{laborcost}\) is per-employee wages and benefits, and \(\text{costofcapital}\) is the average long-term interest rate.
Code
df_cap_panel_elhelaly <- df_cap |>
select(gvkey, year, contains("sic"), sale, xad, adstock, invfg, xsga,
marstock, rect, installedbase, pat_count, pat_stock, xrd, cogs,
laborcost, costofcapital) |>
group_by(gvkey) |> arrange(gvkey, year) |>
mutate(
rec = dplyr::lag(rect, 1) + dplyr::lag(rect, 2) + dplyr::lag(rect, 3),
patstock = dplyr::lag(pat_count, 1) + dplyr::lag(pat_count, 2) + dplyr::lag(pat_count, 3),
accumrd = dplyr::lag(xrd, 1) + dplyr::lag(xrd, 2) + dplyr::lag(xrd, 3)
) |>
pdata.frame(c("gvkey", "year"))Code
library(frontier)
cap_mar <- frontier::sfa(
log(sale) ~ log(xad) + log(adstock) + log(xsga) + log(marstock) +
log(rect) + log(rec) + log(installedbase) | sic1 + factor(year),
ineffDecrease = TRUE, timeEffect = TRUE, data = df_cap_panel_elhelaly
)
df_cap_panel_elhelaly$mar_eff <-
frontier::efficiencies(cap_mar, asInData = TRUE)
cap_rd <- sfa(
log(pat_count) ~ log(patstock) + log(xrd) + log(accumrd) | sic1 + factor(year),
timeEffect = TRUE, ineffDecrease = TRUE, data = df_cap_panel_elhelaly
)
df_cap_panel_elhelaly$rd_eff <-
frontier::efficiencies(cap_rd, asInData = TRUE)
cap_op <- sfa(
log(cogs) ~ log(invfg) + log(laborcost) + log(costofcapital) | sic1 + factor(year),
timeEffect = TRUE, ineffDecrease = FALSE, data = df_cap_panel_elhelaly
)
df_cap_panel_elhelaly$op_eff <-
frontier::efficiencies(cap_op, asInData = TRUE)
df_cap_panel_elhelaly |>
write_rds(file.path("data", "capability", "df_cap_panel_elhelaly.rds"))Code
df_cap_panel_elhelaly <-
read_rds(file.path("data", "capability", "df_cap_panel_elhelaly.rds"))
df_cap_panel_elhelaly |> select(contains("eff")) |> na.omit() |> cor()
#> mar_eff rd_eff op_eff
#> mar_eff 1.00000000 -0.02846301 -0.05873300
#> rd_eff -0.02846301 1.00000000 -0.04156963
#> op_eff -0.05873300 -0.04156963 1.0000000028.3.7.3 Cao, Feng, and Wiles (2023)
This specification uses a valuation outcome—Total Q in place of Tobin’s Q—for the marketing frontier and a lagged-R&D, lagged-patent-stock structure for the R&D frontier:
\[ \log(\text{Total } Q_t) = \log(\text{xsga}_t) + \log(\text{xsga}_{t-1}) + \log(\text{xad}) + \log(\text{pat}), \tag{28.17}\]
\[ \log(\text{pat}) = \log(\text{xrd}_t) + \log(\text{xrd}_{t-1}) + \log(\text{patstock}_{t-1}), \tag{28.18}\]
where \(\text{pat}\) is the number of patents.
Code
df_cap_panel_cao <- df_cap |>
left_join(totalq, by = join_by(gvkey, year)) |>
select(gvkey, year, contains("sic"), q_tot, xsga, xad,
pat_count, xrd, pat_stock) |>
group_by(gvkey) |> arrange(gvkey, year) |>
mutate(
xsga_t_1 = dplyr::lag(xsga, n = 1),
xrd_t_1 = dplyr::lag(xrd, n = 1),
pat_stock_t_1 = dplyr::lag(pat_stock, n = 1)
) |>
pdata.frame(c("gvkey", "year"))Code
library(frontier)
cap_mar <- frontier::sfa(
log(q_tot) ~ log(xsga) + log(xsga_t_1) + log(xad) + log(pat_count) |
sic1 + factor(year),
ineffDecrease = TRUE, timeEffect = TRUE, data = df_cap_panel_cao
)
df_cap_panel_cao$mar_eff <-
frontier::efficiencies(cap_mar, asInData = TRUE)
cap_rd <- sfa(
log(pat_count) ~ log(xrd) + log(xrd_t_1) + log(pat_stock_t_1) |
sic1 + factor(year),
timeEffect = TRUE, ineffDecrease = TRUE, data = df_cap_panel_cao
)
df_cap_panel_cao$rd_eff <-
frontier::efficiencies(cap_rd, asInData = TRUE)
df_cap_panel_cao |>
write_rds(file.path("data", "capability", "df_cap_panel_cao.rds"))28.3.8 The Broader Marketing-Capability Literature
The frontier-estimation tradition above is one strand of a much larger literature that decomposes marketing capability into functional sub-capabilities and links each to performance. The foundational input weights used in the stocks above—0.5 for marketing expenditure and 0.4 for R&D—come from Dutta, Narasimhan, and Rajiv (2005), and the brand/advertising-stock weighting from Peles (1971) and Z. Wang and Kim (2017). Building on these primitives, scholars have isolated specialized capabilities: marketing capability broadly (Bahadir, Bharadwaj, and Srivastava 2008; Xiong and Bharadwaj 2013; Wiles, Morgan, and Rego 2012; Mishra and Modi 2016; Dinner, Kushwaha, and Steenkamp 2018), marketing alliance capability (Swaminathan and Moorman 2009), digitized selling capability (D. S. Johnson 2005), big-data capability (J. S. Johnson, Friend, and Lee 2017), social-CRM capability (Trainor et al. 2014), and customer-relationship capability (Z. Wang and Kim 2017).
The single most influential disaggregation is Morgan, Slotegraaf, and Vorhies (2009), who link three core marketing capabilities to profit growth and uncover a tension invisible to coarser analyses. Although profit growth is a primary driver of stock price, the mechanism by which marketing capabilities feed it was poorly understood. Using a cross-industry sample of 114 firms, Morgan, Slotegraaf, and Vorhies (2009) decompose marketing capability into market sensing, brand management, and customer relationship management (CRM), and decompose profit growth into revenue growth and margin growth. The capabilities exert both direct and synergistic effects on the two growth components—but, critically, brand-management and CRM capabilities can counteract each other across the components: a capability that lifts revenue growth may simultaneously depress margin growth, and vice versa. A surface-level analysis that examines only aggregate profit growth would miss these offsetting effects and mis-state the true relationship between capabilities and performance. The methodological moral reinforces the theme of this chapter: the level of aggregation at which a metric is defined determines what relationships it can reveal.
A generic frontier implementation, parameterized by a sub_market grouping, ties the strands together; the sfaR package offers a cross-sectional alternative when the panel structure is unavailable.
Code
28.3.8.1 Digital, Social-Media, and Organizational Capabilities
The capability lens has been extended to the digital domain. Survey-based studies of digital marketing capability (F. Wang 2020; Homburg and Wielgos 2022) and a synthesizing review (Herhausen et al. 2020) map the organizational routines through which firms convert digital assets into performance, while Nguyen et al. (2015) treats social-media strategic capability as a distinct competence.
A complementary line generalizes from functional capabilities to the embeddedness of capabilities in the organization. Grewal and Slotegraaf (2007) argue that managers must deploy scarce resources to build durable capabilities, and that neglecting the underlying processes obscures how capabilities translate into competitive advantage. Their central construct is capability embeddedness—the depth at which a capability is ingrained in the organization, itself a consequence of managerial resource-allocation decisions. Methodologically, they introduce a hierarchical composed-error framework (a multilevel generalization of the stochastic frontier) that applies to cross-sectional and panel data alike, and they show in a retailing application that capability embeddedness directly improves performance even after controlling for tangible and intangible resources. The framework’s payoff is diagnostic: recognizing whether the objectives of different capabilities are convergent or divergent tells a manager whether deepening embeddedness will amplify or undercut firm performance—an organizational echo of the revenue/margin tension in Morgan, Slotegraaf, and Vorhies (2009).
28.4 Management Metrics
A final, smaller family of metrics characterizes the firm’s strategy and leadership, which moderate how marketing investments convert into performance. Two constructs have proven especially measurable from archival data. Founding strategy, and specifically the degree of market differentiation a firm adopts at founding, can be recovered from the language of its early communications and product positioning; Guzman and Li (2023) develop a scalable measure of this founding differentiation. CEO overconfidence—a behavioral trait with consequences for innovation and investment—is measured from executives’ revealed reluctance to exercise in-the-money stock options, following the option-based approach Galasso and Simcoe (2011) apply to study innovation. Both illustrate the chapter’s recurring method: a psychological or strategic construct, once thought to require surveys or inside access, recovered at scale from the residue firms leave in archival and market data.
28.5 Key Takeaways
A metric is a definition plus an identification argument. The financial measures of Section 28.1 are easy to compute and hard to interpret causally: ROI and ROMI summarize an effect whose incremental numerator is itself the object of inference, and EVA and MVA price value creation only as well as invested capital and market expectations are measured. The accounting covariates are largely settled in their construction—Table 28.1 and Table 28.2 catalogue the conventions—but their role is to neutralize confounding, and they earn that role only when chosen to match the marketing decision under study. The marketing constructs of Section 28.2 exemplify the field’s methodological frontier: sentiment, purchase intention, trust, and reputation, once the province of surveys, are now recovered from text and images by pre-trained models whose accuracy and biases must themselves be validated. Capability, the subject of Section 28.3, is the hardest case and the template for the rest: a latent efficiency, identified only under explicit distributional and exogeneity assumptions, whose credibility rests entirely on the care taken in building the input stocks and specifying the frontier. Across all three families, the lesson is the same—report the metric, but report the assumptions that make it mean what you claim.
The item definitions follow the WRDS data-items documentation (
wrds_data_items). Items prefixed in CRSP/Compustat Merged (e.g.,PRCC_C,CSHO) are drawn from the merged fundamentals files; market-price and shares-outstanding items used for market capitalization come from CRSP.↩︎The fine-tuned purchase-intention model is distributed publicly as a large RoBERTa checkpoint, allowing researchers to score new English-language text for expressed purchase intent without retraining.↩︎
The two large input files (
dynass_nber_searle.dtaandpatents_conveyance.dta) are too large to redistribute here but are publicly archived; the code reconstructs the firm-year patent counts from them.↩︎