What are the most important statistical ideas of the past 50 years?

Counterfactual Causal Inference, Bootstrapping, Regularization, and more from Andrew Gelman and Aki Vehtari

Published

June 3, 2021

Modified

June 11, 2024

My summary:

A statistics professor, Andrew Gelman, from Columbia University, and a computer science professor, Aki Vehtari, from Aalto University, look at the top statistical ideas of the past half-century.

Here is their list:

Counterfactual Causal Inference
Bootstrapping and Simulation-Based Inference
Overparameterized Models and Regularization
Multilevel, or Hierarchical, Models
Generic Computation Algorithms
Adaptive Decision Analysis
Robust Inference
Exploratory Data Analysis

Connection to Machine Learning and Deep Learning

One common thread among many of these ideas is how they take advantage of the advances in computing over the past 50 years. Iterative algorithms, like bootstrapping, become practical when a computer can relatively quickly run multiple iterations of the same experiment.

A lot of the statistics ideas here have found applications in machine learning. Number 3 on the list above, overparameterized models and regularization, is central to machine learning and deep learning. According to Gelman and Vehtari, “a major change in statistics since the 1970s, coming from many different directions, is the idea of fitting a model with a large number of parameters — sometimes more parameters than data points — using some regularization procedure to get stable estimates and good predictions.” One of the biggest challenges in machine learning and deep learning is overfitting, where we train a model that performs amazingly well on our training dataset, but performs poorly on our unseen holdout dataset. Regularization, of different varieties, is used widely by data scientists to address overfitting.

Furthermore, bootstrapping, which is number 2 on Gelman and Vehtari’s list, is an important component of random forest models. In random forest, an ensemble of trees is built, where each tree uses a bootstrap sample from the original training set. This helps to control overfitting, as each tree in the ensemble is trained on a slightly different version of the original dataset.

In terms of number 6 on the list, a relatively famous example of adaptive decision analysis these days is reinforcement learning.

Interesting to see exploratory data analysis on Gelman and Vehtari’s list (coming in at number 8), but it certainly falls within the realm of statistics, and again, advances in computing, and the proliferation of personal computers, allows the average researcher or analyst to generate a range of plots, much more easily than if one is creating plots with pen and paper. The ease with which plots can be made with various tools means there is little excuse not to perform a rigorous EDA on your dataset before going to the modeling phase.

Other Statistical Methods

The first item on Gelman and Vehtari’s list is counterfactual causal inference. According to the Stanford Encyclopedia of Philosophy, “the basic idea of counterfactual theories of causation is that the meaning of causal claims can be explained in terms of counterfactual conditionals of the form ‘If A had not occurred, C would not have occurred’”.

Here is the link again to their paper for more details: What are the most important statistical ideas of the past 50 years?