Authors: First Author: Matthew J. Salganik
Note: The point of this article is that many different teams of researchers (160) worked on the same problem to see what could be learned about the methods that they utilized. I am not listing them all here because it's a bit ridiculous to do so — please follow the link below to the original article to get this information.
Publication, Year: PNAS, 2020
Notes by: Matthew R. DeVerna
Measuring the predictability of life outcomes with a scientific mass collaborationOverall FindingsIntro and LitThe Common Task MethodBenefits of this methodData UtilizedFragile Families and Child Wellbeing StudyCreating the Common Task MethodOutcome VariablesPrediction DataClarification on Challenge TaskLogistics of the ChallengeUploading Predictions$R_{Holdout}^{2}$ResultsEveryone Did BadAdditional InsightsDiscussion
Everyone did a really bad job at predicting life outcomes
This was despite the fact that 160 teams of researchers worked on training predictive models in many different ways
This sheds light on the potential limitations of predicting problems like this
Also it suggests that we may need to reassess life outcome predictive models and reassess the problem itself
Much has been learned about the factors that affect human life outcomes. The ability to predict individual life outcomes however is not well developed
This would be important for three reasons:
A mass collaboration called the Fragile Families Challenge was started to utilize a method called the "common task method" to study the predictability of life outcomes
Longitudinal in nature
Follows thousands of families, each of whom gave birth to a child in a large US City around the year 2000
Was designed to understand families formed by unmarried parents and the lives of children born into these families
Has been used in more the 750 published articles
Data was collected in six waves:
About the Collection Waves
Each data collection module is made up of 10 sections and each section focuses on a specific topic (i.e., child health, father-mother relationships, marriage attitudes, etc.)
In Home Assessment
At waves 3, 4, and 5 (ages 3, 5, and 9) an in-home assessment was conducted. This included:
The researchers created the Fragile Families Challenge to recruit researchers before the final wave (wave 6 — age 15) was available to researchers outside the "Fragile Families Team".
Wave 6 (age 15) included 1,617 variables — six of these were chosen to be the focus of the Fragile Families Challenge:
Why these were selected:
Researchers were given access to a background dataset:
Included waves 1-5 (birth to age 9)
Excluded genetic and geographic information
4,292 families
12,942 variables
In addition, they also had access to a smaller training dataset which included the six outcomes for half of the families (Fig. 2)
The point here was to use data from wave 1 to 5 (bird to age 9) and some data from wave 6 (age 15) to build a model that could then be used to predict the wave 6 outcomes for other families.
The task was not to predict wave 6 only from data from waves 1 through 5, which would obviously be more difficult.
The half of the data that was withheld was split into two different groups:
Error Metric Used to Evaluate All Models —> Mean Squared Error
While the Fragile Families Challenge was underway participants could upload their submissions to their website. All submissions included:
After submission, participants could see their score on a leaderboard which ranked the accuracy of all uploaded predictions against the leaderboard data.
All researchers agreed to the Fragile Families Challenge procedures, included to open-source their final submissions
To aid interpretation and facilitate comparisons across the six outcomes, all data was presented in terms of the mean squared error metric on the holdout data —>
This metric rescales the mean squared error of a prediction by the mean squared error when predicting the mean of the training data...
… is bounded above by 1 and has no lower bound. It provides a measure of predictive performance relative to two reference points.
Once the challenge was complete, all data was scored against the holdout data and they learned:
Across the board, the prediction models did a very poor job
for:
Material hardship & GPA = .02
Other four variables was about .05
Finally, they even note that their procedure (using a portion of the holdout data to train the models) would likely be slightly optimistic, so this is a very difficult problem.
They also observed three important patterns within the submission data.
Teams used a variety of different data processing and statistical learning techniques to generate predictions
Despite this difference, the resulting predictions were quite similar
Many observations (e.g., the GPA of a specific child) were accurately predicted by all teams — yet, a few observations were poorly predicted by all teams (Fig. 4)
The Fragile Families Challenge speaks directly to the predictability of life outcomes in only one setting: six specific outcomes, as predicted by a particular set of variables measured by a single study for a particular group of people.
Low predictive accuracy cannot be attributed to the limitations of any particular researcher or approach
Predictability is likely to vary:
Regardless, they have major implications and suggest future work
For example, over 750 articles have been published with this dataset, do we think that what was learned using this data is actually valuable — give that it could not be used to predict life outcomes?
Reconciling this understanding/prediction paradox can be done in at least three ways
If prediction = understanding then these results suggest that the current understand of child development is quite poor
We could also argue that prediction does not = understanding
We could conclude that the prior understanding is correct but incomplete. It simply lacks sufficient theory to explain why we should expect outcomes to be difficult to predict — even with high quality data.
Researchers making predictive models in the fields of criminal justice system and child-protective services should be concerned by these findings
Since benchmark models were not much worse than other more complex models — they suggest we should start with those and see if their predictive ability is suitable
There are many longitudinal studies taking place all over the world right now — they could all be utilized for a large-scale Common Task Method study such as this one.
Notes by Matthew R. DeVerna