Click on the "Agenda" tab to view the event video.
We propose a method to assess the sensitivity of data analyses to the removal of a small fraction of the data set. Analyzing all possible data subsets of a certain size is computationally prohibitive, so we provide a finite-data metric to approximately compute the number (or fraction) of observations that has the greatest influence on a given result when dropped. We call our resulting metric the Approximate Maximum Influence Perturbation. Our approximation is automatically computable and works for common machine learning methods.
At minimal computational cost, our metric provides an exact lower bound on sensitivity, so any non-robustness our metric finds is conclusive. We demonstrate that the Approximate Maximum Influence Perturbation is driven by the signal-to-noise ratio in a data-analysis problem, is not reflected in standard errors, does not disappear as the number of data points grows, and is not a product of misspecification. We focus on econometric analyses in our applications. Several empirical applications show that even 2-parameter linear regression analyses of randomized trials can be highly sensitive. While we find some applications are robust, in others the sign of a treatment effect can be changed by dropping less than 0.1% of the data even when standard errors are small. In one case, we identify a single data point out of over 16,500 that changes the conclusions of a data analysis.
About First Friday Lunches
First Friday Lunches are informal gatherings open to CSAIL Alliances members and students who would like to attend a discussion on a current project. Most virtual lunches will feature a Faculty Researcher but a few will feature a PhD student, post-doc or Research Scientist.