by Luke Shulman

“Social Determinants” are fast becoming the new buzzword in population health. Recent findings have reiterated what many intuitively know. 60% of a patient’s total health is determined not by their clinical conditions but by other factors such as where they live, their household status, and their behaviors - collectively SDOH. source

The following chart from the link above summarizes the social determinants by category:

Social Chart

Health organizations including insurance plans and hospital systems don’t have great data on these factors. Based on the presentations I saw at the Actionable Analytics Summit in San Diego this week. Analysts are clamoring for it. For each presenter who highlighted the value of social data to their work, hands were raised “where do I get this data?”

Well, here you go. In this post we will go over the predominant sources of this data. We will start with broad sources taken from official samples by the Census, CDC and other agencies. These sources aren’t individualized down to a specific person but they do summarize geographic regions allowing for statistically valuable approximation.

The focus of this article will be on these geographic sources. But, I want to also mention data sources with individualized from industry vendors (some licensed by Algorex) that can help append data values from massive databases such as credit bureaus, real estate transactions, and online behaviors. There is so much valuable insight from these data aggregators and something I plan to cover in a subsequent post.

Government Geographic Sources

For the United States, the Census Bureau and the CDC both perform large statistically robust surveys to identify trends in American demography and health. These provide a rich tapestry of geographically relevant data that can approximate the SDOH elements of a community where a patient resides.

Census Bureau

The US Census American Community Survey is a robust survey based study about the income, household, and social status of Americans in every municipality. The 5-year aggregation of the ACS includes data at the census block an area ranging to a population of about 3,000 people.

To access the data, the Census Bureau disseminates it through a range of sources from API, FTP to the American FactFinder website.

The best part of the census data is that for statisticians using R, the census data has been packaged into an R package called UScensus2010 which means you can use the data in R without any ETL or downloading of files separately.

During the sessions in San Diego, I pointed out that plotting census county variables in a map is actually one of training tutorials that is used to teach R. I think that is a great testament to the power of open-source data science. In one learning session, you learn to consume one of the largest most consequential health & economic datasets into an interactive visualization.

US Chloropleth

Here is a tutorial using RStudios Shiny Web Framework: Shiny Lesson 5 Here is another great tutorial using R though much more in depth: Making Maps in R For python folks, here is the same process in python: Flowing Data Mapping in Python Or check out this python example using leaflet: Folium & Leaflet

I point these out because these are often some of the first visualizations you learn when starting to use the Open Data Science tools.


The Centers for Disease Control releases so much data it is difficult for me to summarize in one paragraph here. The data includes epidemiological reports, statistics about accidents, even mortality rates. Most population health users should gravitate to the Behavioral Risk Factor Surveillance System which is a survey of various health behaviors and quality of life indicators.

Intelligent Aggregators

If you really want data down at the sub-county level, you may have to wade through the raw data from CDC and the Census Bureau. But, for most users, I highly recommend using data that has been pre-processed in various public health projects. The programs join data from multiple CDC and Census sources into one data product. Furthermore, these projects often assemble the survey variables into meaningful scores and rankings that are better suited for comparison. The two aggregation projects I want to highlight are as follows:

  1. County Health Rankings: This is truly an amazing project sponsored by the Robert Wood Johnson Foundation. Joining data from 20 different sources across the CDC, Census Bureau, Agriculture Department, FBI, even the transportation department. All these variables are then put into rankings to allow for comparisons. The data is also easily accessible in SAS and CSV formats. I highly recommend that users looking for social determinant data start with this project.
  2. 500 Cities: Another data project sponsored jointly by the Robert Wood Johnson foundation and the CDC focuses on data for 500 of the largest metropolitan areas across the country. Although there is less coverage, the health data in this project is deeper with detailed prevalence information on 27 different chronic conditions.

Supplemented Private Geographic Sources

Several companies sell products based on the CDC and ACS data but with additional projections and economic analysis to help fill in the gaps in the ACS and to make the data a little bit easier to consume. One firm is Geolytics another is Proximity One. Others bundle this type of information in mapping data products such as ESRI’s ArcGIS software etc. I have not directly used these but conference attendees said these products are often significantly less expensive than individualized data.

Individualized Data

Now for the fun part. Whenever I move, I immediately end up with offers for a new cable subscription and a Bed, Bath and Beyond Coupon. These offers are not accidental but are triggered by the massive private data exchanges between online firms (facebook, google), retailers, financial institutions and other data aggregators. There is a lot of promise in these data sets which often allow for direct linking to a person. At Algorex, we do have standard licensing to these data sets to use in model development for customers where appropriate.

Most of these datasets are designed to acquire unknown customers. For health plans and health systems participating in value-based contracts, the problem is not necessarily acquiring the unknown customer but learning more about the customers you have. These sets have unique insights into behavior data and the ability to link it to individuals can make a big difference in improving engagement and outreach. Some services are simple such as address correction or email address correction. Other data sets are more complex with derived personas based on spending and online behavior.

There is a lot more to get into about how health organizations can successfully leverage this data. We plan to tackle that in later posts so be sure to follow our blog and follow @AlgorexHealth on twitter.

Credits Kaiser Family Foundation Robert Wood Johnson Foundation