Measuring the predictability of life outcomes with a scientific mass collaboration

Authors: First Author: Matthew J. Salganik

Note: The point of this article is that many different teams of researchers (160) worked on the same problem to see what could be learned about the methods that they utilized. I am not listing them all here because it's a bit ridiculous to do so — please follow the link below to the original article to get this information.

Publication, Year: PNAS, 2020

Link to Paper

Notes by: Matthew R. DeVerna

Measuring the predictability of life outcomes with a scientific mass collaborationOverall FindingsIntro and LitThe Common Task MethodBenefits of this methodData UtilizedFragile Families and Child Wellbeing StudyCreating the Common Task MethodOutcome VariablesPrediction DataClarification on Challenge TaskLogistics of the ChallengeUploading Predictions$R_{Holdout}^{2}$ResultsEveryone Did BadAdditional InsightsDiscussion

Overall Findings

Intro and Lit

Much has been learned about the factors that affect human life outcomes. The ability to predict individual life outcomes however is not well developed

This would be important for three reasons:

  1. Good predictions can be used to help families at risk
  2. Efforts to understand differences in predictability across social contexts can stimulate scientific discovery
  3. Predictive improvements can help lead to better methods and improve theory

A mass collaboration called the Fragile Families Challenge was started to utilize a method called the "common task method" to study the predictability of life outcomes

The Common Task Method

Benefits of this method

Data Utilized

Fragile Families and Child Wellbeing Study

Each data collection module is made up of 10 sections and each section focuses on a specific topic (i.e., child health, father-mother relationships, marriage attitudes, etc.)

Creating the Common Task Method

Outcome Variables

Wave 6 (age 15) included 1,617 variables — six of these were chosen to be the focus of the Fragile Families Challenge:

  1. Child Grade Point Average
  2. Child Grit
  3. Household Eviction
  4. Household Material Hardship
  5. Primary Caregiver Layoff
  6. Primary Caregiver Participation in Job Training

Why these were selected:

Prediction Data

Clarification on Challenge Task

The point here was to use data from wave 1 to 5 (bird to age 9) and some data from wave 6 (age 15) to build a model that could then be used to predict the wave 6 outcomes for other families.

The task was not to predict wave 6 only from data from waves 1 through 5, which would obviously be more difficult.

The half of the data that was withheld was split into two different groups:

  1. Leaderboard — This dataset could be utilized to test the prediction methods while the Fragile Families Challenge was underway
  2. Holdout — This data was untouched until the very end by all researchers involved and then was eventually utilized to evaluate the prediction models

Error Metric Used to Evaluate All Models —> Mean Squared Error

Logistics of the Challenge

Uploading Predictions

To aid interpretation and facilitate comparisons across the six outcomes, all data was presented in terms of the mean squared error metric on the holdout data —>

This metric rescales the mean squared error of a prediction by the mean squared error when predicting the mean of the training data...

is bounded above by 1 and has no lower bound. It provides a measure of predictive performance relative to two reference points.



Everyone Did Bad

Once the challenge was complete, all data was scored against the holdout data and they learned:

Finally, they even note that their procedure (using a portion of the holdout data to train the models) would likely be slightly optimistic, so this is a very difficult problem.

Additional Insights

They also observed three important patterns within the submission data.

  1. Teams used a variety of different data processing and statistical learning techniques to generate predictions

  2. Despite this difference, the resulting predictions were quite similar

    • For all outcomes, the distance between the most divergent submissions was less than the distance between the best submission and the truth
    • Put another way, the submissions were much better at predicting each other than at predicting the truth
    • This means that their attempt to create an ensemble of predictions did not deliver any sort of substantial improvement in predictive accuracy
  3. Many observations (e.g., the GPA of a specific child) were accurately predicted by all teams — yet, a few observations were poorly predicted by all teams (Fig. 4)

    1. As a result, within each outcome, squared prediction error was strongly associated with the family being predicted and weakly associates with the technique used to generate the prediction




Notes by Matthew R. DeVerna