Motivation
The past decade has witnessed significant success of large machine learning (ML) models. The access to massive data plays a key role to this success. As the popular saying goes, “data are the new oil” which powers the engine of ML. Thus competitions on generating, buying or selling data become an increasing trend in key industrial sectors such as IT, finance, biomedical, etc. However, in contrast to the extensive research on developing data-driven machine learning models, the research on data itself — e.g., how to elicit, evaluate, exchange and price data — has been far less explored and, in fact, only begun recently. A principled understanding about data itself is crucial for facilitating the creation, populating and usage of data.
Our goal for this tutorial is to bring you up to speed on recent studies about the value of data from both statistical and economic perspectives, how to effectively price data or information, and how to collect data from economic agents. For each aspect, we will describe fundamental concepts, current state or key results, as well as open questions.
Overview
Following a brief introduction, the tutorial is then divided into three parts, presented respectively from the perspective of three key stakeholders in this domain: data buyers, data sellers, and ML vendors. Part one stands at a data buyer's perspective and covers statistical methods and economic methods for modeling the value of data. Part two is from a data seller's perspective and discusses recent works about how to price data and how to collect data by aggregating information from a population. Finally, we discuss the market and competition among ML vendors.
Remarks. Throughout the tutorial, we will view information as a special type of data -- i.e., distilled data with direct usefulness/insights --and thus will use the terms exchangeably. This tutorial will not assume prior knowledge beyond basic mathematics and probability.
Schedule
8:30am - 8:45am | Part 0: Introduction (by James) | |
Quick overview of existing data markets, challenges and opportunities. | ||
8:45am - 9:30am | Part 1A: Buyer Side -- Statistical Modeling for Data Valuation (by James) | |
Statistical models for valuating contributions of data to an ML model, and their applications to de-noise, fairness and active learning | ||
9:30am - 10:00pm | Part 1B: Buyer Side -- Economic Modeling for Value of Information (by Haifeng) | |
Measures for quantifying the economic value of information (i.e., distilled data), and characteristic properties of these measures | ||
10:00am - 10:15am | Short break | |
10:15am - 10:55 am | Part 2A: Seller Side -- Optimal Pricing of Information (by Haifeng) | |
Recent studies on designing pricing mechanisms for selling information in structured or generic setup, and for selling raw data to a machine learner | ||
10:55am - 11:45pm | Part 2B: Seller Side -- Data Valuation with Peer Prediction (by Shuran) | |
Mechanisms for eliciting truthful data from economic agents and avoiding potential data manipulations | ||
11:45am - 12:00pm | Short break | |
12:00pm - 12:20pm | Part 3: ML-as-service market (by James) | |
Competitions among ML-as-service vendors, and the opportunities (e.g., arbitrage by a third party) and challenges arise from such competition | ||
12:20pm - 12:30pm | Discussions, Q&A | |