Access to data matters. Why don’t we talk more about that?

You’d have to be living under a rock not to have heard about a blockbuster JAMA paper looking at the relationship between income and life expectancy in the US. The Upshot has you covered here and here and here.

I want to focus on the methods of this thing:

Design and Setting  Income data for the US population were obtained from 1.4 billion deidentified tax records between 1999 and 2014. Mortality data were obtained from Social Security Administration death records. These data were used to estimate race- and ethnicity-adjusted life expectancy at 40 years of age by household income percentile, sex, and geographic area, and to evaluate factors associated with differences in life expectancy.

From the methods section (emphasis mine):

The analysis used a deidentified database of federal income tax and Social Security records that includes all individuals with a valid Social Security Number between 1999 and 2014.

Income data were obtained from tax records for every individual for every year from 1999 through 2014. The primary measure of income was pretax household earnings. For those who filed tax returns, household earnings were defined as adjusted gross income plus tax-exempt interest income minus taxable Social Security and disability benefits. For those who did not file a tax return, household earnings were defined as the sum of all wage earnings (reported on form W-2) and unemployment benefits (reported on form 1099-G). When individuals had no tax return and no information returns, household earnings were $0. For nonfilers, earnings did not include the spouse’s income. However, the vast majority of nonfilers who are not receiving Social Security benefits are single. Income was adjusted to 2012 dollars using the consumer price index.

Mortality was measured using Social Security Administration (SSA) death records. Total deaths in the SSA data closely match data from the National Center for Health Statistics (NCHS), with correlations exceeding 0.98 across ages and years (part I of the eAppendix, eFigure 1, and eTable 1 in the Supplement). Observations with income of $0 were excluded because the SSA does not fully track deaths of nonresidents, and thus mortality rates for individuals with income of $0 are mismeasured or unavailable. After excluding observations with income of $0, individuals were assigned percentile ranks from 1 to 100 based on their household earnings relative to all other individuals of the same sex and age in the United States during each year.

This is an amazing analysis, don’t get me wrong. But exactly how does one go about getting tax records for every individual in the US for every year from 1999 through 2014? How does one get the SSA to turn over the records of every individual with a valid SSN between 1999 and 2014? I wouldn’t even know where to start.

Sometimes studies like these are completely dependent on one’s ability and influence to get the data from the organizations that house it. We never talk about that. No one shares their secrets. Studies like this are, therefore, rare.


Hidden information below


Email Address*