Million Dollar Burden
Million Dollar Burden
“Health care is complicated.” It’s the joke I keep hearing at every meeting over the past two months and with health data it’s absolutely true. Even the most advanced health information systems are left with data that is classified into hundreds of thousands of granular codes. The formats of the documents are then more complicated with either EDI transactions C-CDA documents, FHIR, or just trying to make sense of a flat-file created for a point-to-point exchange.
The Value Set Authority Center lists 15 different code systems that can be used for population health. (shout out the VSAC is a great resource) These are as varied as “Health Service Location Codes” which are just 189 codes used to identify the locations of services or SNOMED CT which has over 311,000 concepts for health classification.
I’ve heard many IT managers feel like this data is a burden. There is constant flow of ETL jobs amongst various vendors. They group things differently and certain categorizations then are only available in that database or application with no updates until the overnight batch. Well, the data doesn’t have to be a burden. With a some simple open source tools you classify millions of rows of claims or clinical data and get immediate results.
With all of this complexity, it can seem impossible to distill analytics to a broad enough level for it to be consumable. Many organizations turn to vendors who specialize simply in classification and grouping of codes. All in all, categorization can be one of the most significant obstacles to starting a health analytics project. But, it shouldn’t be with a few simple tools organizations can take advantage of a range of mappings that have been published for research or other purposes. “Bias to action”. Its better to get started with a basic mapping system than to nothing at all.
With this in mind, I wanted to publish this notebook that uses our open-source health categorization functions in a jupyter notebook. When we kick off one of our analytics sprints for a customer, we often start with the steps detailed just so we can have the procedure or diagnosis data categorized before we move forward. You can access the source code on our GitHub page at CMS Code Categorizer to access the Jupyter Notebook look at the Blog branch at that repository.
What our library does:
- Provides a basic implementation of categories from the Health Care Cost Institute.
- Provides functions to calculate the correct position of a code in a range of HCPCS codes. Because HCPCS codes are both numeric and alpha-numeric with gaps in the ranges, this is not a simple function. See
is_in_range()in the source code.
- Provides structures for anyone to add their own mapping as a simple python dictionary. you can do this by range or by specific code.
So lets try it out.
To test this out, we are going to use the Medicare 2017 fee-schedule which can be accessed at CMS website:
We will use the Mass File for this example PFMA17A
import pandas as pd import os
df = pd.read_csv('PFMA17A.txt',names=['Year','CarrierNumber','Locality','HCPCS','Modifier','NonFacFee','FacFee','Fill1','Fill2','PCTC','TherapyReduction','insTherapy','OPPSind','OPPSnonFacFee','OPPSFacFee','Trailer','TrailerInd'])
Now, I am going to focus on the Boston locality “1”.
boston_fees = df[df['Locality'] == 1]
Now lets import the Algorex Categorization library so that we can use it. The file will need to be in the same directory as the project.
import categorizer as codes
Before we categorize the whole file, let’s look at how it works a little. The main method is
carrier_categorizer_by_hcpc which will categorize the HCPCS codes.
So if we wanted to categorize a 99214 Office visit:
So now lets categorize our whole file. We will add a new column with the category to our dataframe.
boston_fees = boston_fees.assign(category=boston_fees['HCPCS'].apply(codes.carrier_categorizer_by_hcpc))
Surgery 5590 Radiology 1680 Other Professional Services 704 Cardiovascular 285 Pathology/Lab 213 Ophthalmology 95 NOTFOUND 92 Physical Medicine 54 Inpatient Visits 41 Allergy 25 Immunizations/Injections 19 Office Visits 19 Psychiatry & Biofeedback 18 Emergency Room/Critical Care 15 Preventive Visits 9 Name: category, dtype: int64
So as we can see, all the codes were categorized except 92 which were labeled as ‘Not Found’. Now why don’t we have some fun and see which codes are on average the most expensive:
avg_costs = boston_fees.groupby('category').mean()
import seaborn as sns %matplotlib inline
g = sns.factorplot(data=avg_costs.reset_index(), y="NonFacFee", x="category", kind="bar", aspect=1.5) g.set_xticklabels(rotation=90)
/home/vagrant/demo/lib/python3.5/site-packages/matplotlib/font_manager.py:1297: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans (prop.get_family(), self.defaultFamily[fontext])) <seaborn.axisgrid.FacetGrid at 0x7fd85e1cedd8>
There we go. No real surprise surgery has the highest prices.
Using the categorizer function, We were able to immediatly categorize almost all 6000 HCPCS codes used the Physician Fee Schedule. It’s also extenable to other code sets as well by simply adding rules dictionaries to the underlying codes. So whether you a payer claims file or a bunch of clinical encounters, go ahead and get the basic analytics from it quickly using open-tools like our categorizer and python.