by Luke Shulman

Million Dollar Burden

“Health care is complicated.” It’s the joke I keep hearing at every meeting over the past two months and with health data it’s absolutely true. Even the most advanced health information systems are left with data that is classified into hundreds of thousands of granular codes. The formats of the documents are then more complicated with either EDI transactions C-CDA documents, FHIR, or just trying to make sense of a flat-file created for a point-to-point exchange.

The Value Set Authority Center lists 15 different code systems that can be used for population health. (shout out the VSAC is a great resource) These are as varied as “Health Service Location Codes” which are just 189 codes used to identify the locations of services or SNOMED CT which has over 311,000 concepts for health classification.

I’ve heard many IT managers feel like this data is a burden. There is constant flow of ETL jobs amongst various vendors. They group things differently and certain categorizations then are only available in that database or application with no updates until the overnight batch. Well, the data doesn’t have to be a burden. With a some simple open source tools you classify millions of rows of claims or clinical data and get immediate results.

With all of this complexity, it can seem impossible to distill analytics to a broad enough level for it to be consumable. Many organizations turn to vendors who specialize simply in classification and grouping of codes. All in all, categorization can be one of the most significant obstacles to starting a health analytics project. But, it shouldn’t be with a few simple tools organizations can take advantage of a range of mappings that have been published for research or other purposes. “Bias to action”. Its better to get started with a basic mapping system than to nothing at all.

Algorex Categorizer

With this in mind, I wanted to publish this notebook that uses our open-source health categorization functions in a jupyter notebook. When we kick off one of our analytics sprints for a customer, we often start with the steps detailed just so we can have the procedure or diagnosis data categorized before we move forward. You can access the source code on our GitHub page at CMS Code Categorizer to access the Jupyter Notebook look at the Blog branch at that repository.

What our library does:

  1. Provides a basic implementation of categories from the Health Care Cost Institute.
  2. Provides functions to calculate the correct position of a code in a range of HCPCS codes. Because HCPCS codes are both numeric and alpha-numeric with gaps in the ranges, this is not a simple function. See is_in_range() in the source code.
  3. Provides structures for anyone to add their own mapping as a simple python dictionary. you can do this by range or by specific code.

So lets try it out.

Test Data

To test this out, we are going to use the Medicare 2017 fee-schedule which can be accessed at CMS website:

Medicare 2017 Fee Schedule. Specifically, you want the carrier files for your state

We will use the Mass File for this example PFMA17A

import pandas as pd
import os

df = pd.read_csv('PFMA17A.txt',names=['Year','CarrierNumber','Locality','HCPCS','Modifier','NonFacFee','FacFee','Fill1','Fill2','PCTC','TherapyReduction','insTherapy','OPPSind','OPPSnonFacFee','OPPSFacFee','Trailer','TrailerInd'])

Now, I am going to focus on the Boston locality “1”.

boston_fees = df[df['Locality'] == 1]

Now lets import the Algorex Categorization library so that we can use it. The file will need to be in the same directory as the project.

import categorizer as codes

Before we categorize the whole file, let’s look at how it works a little. The main method is carrier_categorizer_by_hcpc which will categorize the HCPCS codes.

So if we wanted to categorize a 99214 Office visit:

'Office Visits'

So now lets categorize our whole file. We will add a new column with the category to our dataframe.

boston_fees = boston_fees.assign(category=boston_fees['HCPCS'].apply(codes.carrier_categorizer_by_hcpc))
Surgery                         5590
Radiology                       1680
Other Professional Services      704
Cardiovascular                   285
Pathology/Lab                    213
Ophthalmology                     95
NOTFOUND                          92
Physical Medicine                 54
Inpatient Visits                  41
Allergy                           25
Immunizations/Injections          19
Office Visits                     19
Psychiatry & Biofeedback          18
Emergency Room/Critical Care      15
Preventive Visits                  9
Name: category, dtype: int64

So as we can see, all the codes were categorized except 92 which were labeled as ‘Not Found’. Now why don’t we have some fun and see which codes are on average the most expensive:

avg_costs = boston_fees.groupby('category').mean()
import seaborn as sns
%matplotlib inline
g = sns.factorplot(data=avg_costs.reset_index(), y="NonFacFee", x="category", kind="bar", aspect=1.5)

/home/vagrant/demo/lib/python3.5/site-packages/matplotlib/ UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans
  (prop.get_family(), self.defaultFamily[fontext]))

<seaborn.axisgrid.FacetGrid at 0x7fd85e1cedd8>


There we go. No real surprise surgery has the highest prices.


Using the categorizer function, We were able to immediatly categorize almost all 6000 HCPCS codes used the Physician Fee Schedule. It’s also extenable to other code sets as well by simply adding rules dictionaries to the underlying codes. So whether you a payer claims file or a bunch of clinical encounters, go ahead and get the basic analytics from it quickly using open-tools like our categorizer and python.