Businesses routinely rely on econometric models to analyze and predict consumer behavior. Estimation of such models may require combining a firm’s internal data with external datasets to take into account sample selection, missing observations, omitted variables and errors in measurement within the existing data source. In this paper we point out that these data problems can be addressed when estimating econometric models from combined data using the data mining techniques under mild assumptions regarding the data distribution. However, data combination leads to serious threats to security of consumer data: we demonstrate that point identification of an econometric model from combined data is incompatible with restrictions on the risk of individual disclosure. Consequently, if a consumer model is point identified, the firm would (implicitly or explicitly) reveal the identity of at least some of consumers in its internal data. More importantly, we provide an argument that unless the firm places a restriction on the individual disclosure risk when combining data, even if the raw combined dataset is not shared with a third party, an adversary or a competitor can gather confidential information regarding some individuals from the estimated model.
Identification, data combination and the risk of disclosure
20 December 2011
Working Paper (CWP38/11)