AAAI 2023 The Economics of Data and Machine Learning

Feburary 8th, 2023, 8:30am - 12:30pm
At 37th AAAI Conference on Artificial Intelligence, Washington DC.

Organizers: Haifeng Xu (UChicago), Shuran Zheng (CMU) and James Zou (Stanford).


The past decade has witnessed significant success of large machine learning (ML) models. The access to massive data plays a key role to this success. As the popular saying goes, “data are the new oil” which powers the engine of ML. Thus competitions on generating, buying or selling data become an increasing trend in key industrial sectors such as IT, finance, biomedical, etc. However, in contrast to the extensive research on developing data-driven machine learning models, the research on data itself — e.g., how to elicit, evaluate, exchange and price data — has been far less explored and, in fact, only begun recently. A principled understanding about data itself is crucial for facilitating the creation, populating and usage of data.

Our goal for this tutorial is to bring you up to speed on recent studies about the value of data from both statistical and economic perspectives, how to effectively price data or information, and how to collect data from economic agents. For each aspect, we will describe fundamental concepts, current state or key results, as well as open questions.


Following a brief introduction, the tutorial is then divided into three parts, presented respectively from the perspective of three key stakeholders in this domain: data buyers, data sellers, and ML vendors. Part one stands at a data buyer's perspective and covers statistical methods and economic methods for modeling the value of data. Part two is from a data seller's perspective and discusses recent works about how to price data and how to collect data by aggregating information from a population. Finally, we discuss the market and competition among ML vendors.

Remarks. Throughout the tutorial, we will view information as a special type of data -- i.e., distilled data with direct usefulness/insights --and thus will use the terms exchangeably. This tutorial will not assume prior knowledge beyond basic mathematics and probability.


8:30am - 8:45am Part 0: Introduction (by James)
Quick overview of existing data markets, challenges and opportunities.
8:45am - 9:30am Part 1A: Buyer Side -- Statistical Modeling for Data Valuation (by James)
Statistical models for valuating contributions of data to an ML model, and their applications to de-noise, fairness and active learning
9:30am - 10:00pm Part 1B: Buyer Side -- Economic Modeling for Value of Information (by Haifeng)
Measures for quantifying the economic value of information (i.e., distilled data), and characteristic properties of these measures
10:00am - 10:15am Short break
10:15am - 10:55 am Part 2A: Seller Side -- Optimal Pricing of Information (by Haifeng)
Recent studies on designing pricing mechanisms for selling information in structured or generic setup, and for selling raw data to a machine learner
10:55am - 11:45pm Part 2B: Seller Side -- Data Valuation with Peer Prediction (by Shuran)
Mechanisms for eliciting truthful data from economic agents and avoiding potential data manipulations
11:45am - 12:00pm Short break
12:00pm - 12:20pm Part 3: ML-as-service market (by James)
Competitions among ML-as-service vendors, and the opportunities (e.g., arbitrage by a third party) and challenges arise from such competition
12:20pm - 12:30pm Discussions, Q&A