by Daniel Eklund
This is a post in the Declarative Programming in Healthcare: From Datalog to CHR series.

CHR Design

The following post is an introduction and summary on what will be a series of posts about the use of declarative programming, specifically logical programing, within one of Algorex’s more important libraries – the HCC (hierarchical code categories) risk calculator. It is our hope that this series of posts might be interesting to several classes of reader:

  • The imperative programmer (regardless of field) who has never encountered the term ‘logical programming’ before. We expect this reader to have a good introduction into the importance and use of declarative semantics.
  • Healthcare technologists who, like us, are immersed in a field that is complicated and governed by heuristics. We expect this reader to enjoy the deep dive we have made into translating SAS rules into a more readable form, and the juxtaposition of appropriate technologies with domain knowledge.
  • The novice logical programmer (maybe took a class of Prolog in school), or expert, who knows the technologies we have introduced, but may be interested to see the thought processes and hurdles that we encountered. We expect this reader to give us feedback, correct us, and feel good to know that an unheralded technology still has its adherents.

This is a long series, which is why we have broken it up over several posts, and each post might be of more interest to a different reader. We hope to make each self-contained, but feel pride of ownership to encourage people to read as much as is warranted to their understanding.

… technical debt is a huge problem in the software industry, and one means of tackling it early is to rely on declarative technologies that promote easier reading, reasoning and maintenance.

These posts and our codebase start with an assumption that should be stated early: technical debt is a huge problem in the software industry, and one means of tackling it early is to rely on declarative technologies that promote easier reading, reasoning and maintenance. That this is an assumption going in to these posts, should clarify the interest we had in pursuing such technologies as Datalog, PyDatalog, Prolog, and CHR, but should not detract readers from pushing through to challenge our assumption. We will certainly revisit this premise during our last post.

Some background: The HCC Risk Model

The Center for Medicare and Medicaid Services, commonly known as CMS, runs a series of national value-based efforts for diversifying risk for the providers and insurers who deal with the population it pays healthcare for. Risk sharing, in general (not just adjustment and HCC), goes by many names: capitation, value, risk-sharing, but is essentially predicated on the notion that everyone and every organization in the ecosystem bears a responsibility for the costs, benefits, risks and rewards of healthcare. This starts with the patient themselves, via copays and premiums, and it runs through the provider organizations (hospitals and systems) all the way up to the insurers and the payers (and the biggest payer of them all, CMS).

Risk sharing puts an onus on many healthcare companies to know how risky their populations are to various types of risk, from readmission to mortality to cost. The HCC library that is the focus of this engineering effort is an example of the latter and is meant to assign a “cost risk” value to every patient that a provider bears responsibility for.

The HCC risk score is a risk adjustment algorithm. It is a nation-wide effort to level the playing field for insurers and healthcare companies who have riskier populations that they may be compensated higher (called capitation payments) for patients who have a higher probability of incurring healthcare costs. Thus, the HCC risk score is a numeric value from 0 to 9 (or so) that provides a multiplier for how much a provider gets paid a year to ‘own’ the health and well-being of that patient – a value of 1 is a person of exactly average risk. All told, the HCC scoring adjusts over $100 billion in healthcare costs per year.

Attributed Risk

This HCC risk score is a function of many many clinical and demographic variables, and is intuitive to understand at a high level – a person who is older and who has many diagnoses is much more likely to incur costs in the following year than someone who is younger and has no record of complicating issues.

The fact that CMS has all the data on millions of patients – their diagnoses, and what their costs and outcomes have been over the years – has allowed them to create a statistical model that correlates the independent variables of demography and diagnoses to the dependent variable of cost. This model has already been calculated/regressed and is essentially one huge function – into this function we feed demography (age, gender) and diagnoses (ICD-codes) and out comes a numeric value called the HCC Risk Score. We will not be doing our own data science and adding new variables under different hypothetical models (though we certainly have done such efforts for our clients). What we are interested in doing is taking the published reference implementation and creating a calculator for our clients and for the open-source community, that they may understand their own population prior to and during the year, and appropriately plan for allocation of scarce care management resources.

Function Image

Given that there is a published reference implementation, why must we reimplement? The answer to this is that the CMS code is published in an older statistical language called SAS – a lingua franca for many research and government organizations in healthcare and related fields. The answer also is “There is free and then there is free,” and with deepest respect to the millions of statisticians and organization who use SAS, we feel that a model like this deserves wider dissemination, especially from a technology that you have to pay to use. SAS, with its yearly seat-license model and educational free-to-try models, is not truly free. Moreover, it is ungainly as a general purpose language.

The Buried Lede

We have already reimplemented HCC in Python, in an embedded sub-language called PyDatalog, and released it over a year ago. This post investigates reimplementing it again in a different logic paradigm that emphasize something called “forward chaining”. This new language CHR is relatively new and generally unknown outside of logic programming circles. Our current, datalog-ish (and back chaining) declarative model is available here in our Github repository and is meant for programmers who have some experience in imperative programming and/or Python.

We released this model over a year ago as part of the assumption we explained earlier: that declarative programming, despite its relative scarcity in the greater programming world, is worth the effort so as to “promote easier reading, reasoning and maintenance” and thus decrease current and future technical debt. For those unfamiliar with the term technical debt, please refer to the wikipedia article.

Having optimized our HCC codebase for clarity, we found that it came with a price, (at least inasmuch as our codebase is currently instantiated), and that price was of performance. It is currently an acceptable cost for our needs, as the calculation of our client’s HCC risk scores has never been performance-critical and needs only be done on a monthly, or weekly schedule – i.e. batch. But from an aesthetic standpoint, the performance was painful – somewhere around two or three risk scores per second on a medium CPU in AWS.

For the sake of exploring this performance issue, and allowing for a future business requirement in which a risk score must be calculated faster while still maintaining readability and declarative-oriented programming, we have reimplemented in the above-mentioned sub-genre of declarative logic programming.

Recapitulation

Our HCC algorithm does not calculate a risk model. This risk model has already been engineered by CMS and results in a suite of independent variable coefficients and an assembly of opaque rules for triggering co-morbidities and edits based on input diagnoses (ICD codes).

Our initial PyDatalog model uses declarative logic statements to state under what conditions certain ICD codes roll up into code categories, and then diagnostic categories and then hierarchical code categories (whence the acronym HCC) and then indicators to trigger the additive effect of the linear model. This results in a codebase that declares rules for triggering indicator variables. Here is a short snippet to give you a flavor of what we are talking about.

In a future post, we will explain these lines.

Next Posts

We have seven posts planned for this series, inclusive of this introductory post. As mentioned, some posts might be better oriented to a different kind of reader.

Here is what we have planned:

Post Purpose Reader
This Post To introduce why we are writing this series and invite readers to see where we are going Programmers, healthcare professionals.
Logic Introduction To introduce first-order predicate calculus. To introduce Prolog. To show the natural interrelationships between these technologies and more popular paradigms. Programmers of all kinds.
Why Forward Chaining might be better than Backchaining To dive into the implementation strategy of the standard Prolog model, and explore under what conditions a back chaining strategy is better than forward chaining. Programmers of all kinds.
In search of CHR To show the thought processes we went through to eventually arrive at CHR as our forward-chaining logic system of choice Programmers of all kinds.
The Overall Design To explain the diagram that is at the top of this post. Programmers of all kinds. Healthcare architects.
A deep dive into our HCC-CHR implementation To explain the HCC Algorithm in it fullest by looking at the SAS code and how we translated it both to Python/PyDatalog and to CHR Programmers and Healthcare professionals
Last Thoughts To draw some conclusions on the efficacy of the technologies and the challenges going forward. Also, to revisit our assumptions on the need for declarative programming techniques when balanced against growing an organization and the training/hiring issues Everybody

Teaser Image