We already studied the Medicare data using Data Science studio (DSS) in this video. This time, instead of looking at the impact of payments on the physician prescriptions, we will have a look at the possibility of detecting medicare fraud.
Download here a simplified DSS project using PostgreSQL.
I chose to work with HP Vertica, mainly because it processes aggregations of big datasets very quickly. The simplified project was done on PostgreSQL so that most people can set up it quickly.
To analyse medicare fraud, we will need the following datasets:
the PartD 2013 medicare dataset available here. It contains all 2013 prescriptions aggregated at the physician and drug level, as well as information on the physician (National Physician ID (NPI), name, city, postal code, ... etc.).
PartD for years 2011 and 2012 datasets are available on the Propublica website. Do not hesitate to have a look at their publications, they were a great source of inspiration for me.
Labelled data. There is a nice dataset on NPI exclusions. However, beware that being in the exclusion list is just a proxy we will use for this blog post. We won't extrapolate on our predictions results because this is a fraudster based target and not a fraud one, we will be unable to detect cases like the one of Ortiz-Butcher, a physician victim of a fraud web. As a result, this will probably limit our performance.
Payments : In the US, physicians now have to declare any payment they receive from pharmaceutical companies.
FDA datasets (optional) that will help us join drugs and active ingredients.
A first model
The most interesting dataset is PartD, in which we have for each physician (identified by its NPI) and drug, the total amount of prescriptions, total cost and total number of days prescribed. From this, we should be able to create a few features to qualify the chance of each physician to be a fraudster. I chose to concatenate the three PartD datasets and create feature out of this. I kept only the following informations.
Derive basic features is straightforward. I summed all years to get a simpler dataset and calculated the following features:
the number of different drugs prescribed
the sum of the number of prescriptions, the sum of the number of days prescribed and the sum of the total cost
the variance of these three quantities
the maximum of these three quantities
This can easily be written in SQL with:
The idea behind these features is simple : we expect fraudsters to have a higher total cost in general, because what would be the point to fraud for small amounts of money? We can validate this hypothesis by looking at the mean cost, days and number of prescriptions of non fraudsters versus fraudsters:
But things are often not that simple: some physicians prescribe a lot of drugs (General Practitioners or example) or highly expensive drugs (Oncologists, Cardiologists, ...). So a high total cost could be explained by other factors. On the other hand, some fraudsters will particularly use one drug to fraud. We can attempt to catch this by measuring the maximum and the variance of the cost.
Since we are using data from PartD between 2011 and 2013, I choose to only look at NPI excluded since 2014. Indeed, we need to have data on the fraudster, so we can exclude any date before 2011. We will also have a problem if the exclusion happened during a year we have in PartD because this could lead to biased values of PartD due to exclusion. As a result, our target is not perfect since some fraud NPIs are still in the dataset with a target value of "not fraudster". We also face the problem of having an extremely unbalanced dataset: there are only 298 fraudsters for 808022 NPIs (ie. 0.037 % of fraudsters).
We can still use our features to create simple fraud models. Using DSS, we get:
Our model does not perform very well. This is because we have very few actual examples of fraud in the dataset. This can also be observed on the ROC curve. Because we have only approximately 50 positive points for evaluation, the curve is not smooth at all. The problem does not seem to be an easy one!
Can we do better? Let's try to get some better features. Why would a physician fraud medicare? Perhaps to make more money either by selling the drugs (by making fake prescriptions) or by charging medicare too much for some drugs. The first reason raises the question of which drugs could be the fraudsters favorite : probably the very expensive one or the drugs that could be use in a "recreational way". The second one boils down to looking at outliers in the unit price of drugs distribution.
Adding some "outlier detection" features
Let's add some features to our model. As we said before we could have some physician having normal behaviour except for certain drugs. We could catch that by checking at the outliers in cost distribution or more simply at the 99 percentile of the distribution. But since some drugs are mostly prescribed by certain specialities, we have to check percentiles by drug and speciality. Now if we add +1 to a NPI score every time the corresponding physician is in the 99 percentiles for a drug and a speciality we get a new set of interesting features.
We can do the same for the unitary price distribution for each drug. Does a physician tend to charge medicare too much for a particular drug? We can easily translate that using a few window function in SQL:
Using this new features we get :
Which is not really better than previously. Maybe our features are not that well designed and we need to review our outlier definition. Let's try to replace these features with something "classical" and let's define an outlier as a point that has a greater value than the mean plus four standard deviations. It is easily done via SQL using the following query :
Now our models give:
Which seems better that the models without these outliers variables.
Impact coding drugs
As we said previously, some drugs may be expensive to get or have "recreational drug use". As a result we could assume that actually prescribing some drugs can be a feature of importance for fraud detection. To check if this can be the case, let's have a look at the list of best "impact coded" drugs.
The idea is to check the percentage of time a drug was used in a fraud. In the following table we have calculated for each drug the number of times we see it in the dataset, the number of times it is prescribed by a fraudster, the corresponding percentage and the percentage of time weighted by the total cost. The weighted percentage seems a better heuristic since it also incorporate the fact that a fraudster wants to make money out of it. If we sort by descending "cost_weighted_percent_fraud" and keep only the drugs which were prescribed by at least 5 fraudsters we have the following top list.
So what do we have ? Well the first one is Lidocaine. According to Wikipedia : "Lidocaine is often added to cocaine as a diluent. Cocaine numbs the gums when applied, and since lidocaine causes stronger numbness, a user gets the impression of high-quality cocaine when in actuality, the user is receiving a diluted product." I've seen enough Breaking Bad to know where this is going.
What else? Opana, Suboxone, Oxycodone, Oxymorphone and Methadone are addictive opioids. The last one on the list is Amphetamine which is known to be used for recreational purposes. The rest are mostly skin irritation creams and heart disease medications. I have no explanation for these other drugs but I believe there is a general pattern here.
How can we use this information for our model ? Adding the amount of lidocain or any other drug in our top would be Data Leakage since we have already used the information of fraud itself to create the list. The solution I propose is to do a modification of impact coding rather similar to what I did for the Human Robot Kaggle competion.
The idea is the following:
If I had a score for each drug, I could compute a score for each physician: the weighted score of each drug (weighted by 1 for average, weighted by cost ... etc. )
I could have a score for each drug by calculating a weighted (or not) percentage of physician who are fraudsters and prescribe the drug.
If for each physician I can calculate the drug score with all physician contributions except the one of interest, I avoid data leaks. You can do all of this in SQL because the contributions are linear so the contribution of all but one physician is actually the total minus their own contribution.
As a result we can create two new variables, score and cost_weighted score with the following SQL query (this is the kind of query that makes you love SQL) :
Note that for each line we only add information from other NPIs. As a result, we do not leak anything here. Also notice that this could be seen as a propagation of the fraud information in a bipartite graph : with physicians and drugs nodes. If we simplify this bipartite graph into a graph with all node being physicians, our score is juste a weighted percentage of fraudster neighbours in the graph ! Some other graph features could probably be derived (see these slides for example).
What do these new features do?
A very good improvement! Can we do even better?
Adding other datasets information
We can ask ourselves if there is a link between commiting a fraud and receiving large payments from pharmaceutical companies. In this post, we will simply create a feature being the total sum of payments received in 2013 without distinction of payment type or company. Since the referential is not the same we have to join by physician first name, last name, city and state. Using this new feature we get the following scores :
Turns out it actually increases the model performance! Funny thing: payments seem to be negatively correlated with fraud (according to our logistic regression coefficient). Is it because fraudsters try to stay discrete or because companies avoid physician they know to be fraudulent?
I tried some other feature engineering I won't explain in details in this post. Because the drug field could be a brand name as well as a generic drug, we can think of working with the active ingredient instead of the drug. To do so, we can use this FDA data and fuzzy join on the drug name. We can then recreate analog features (sum, maximum and variance of cost, physician impacted score, ... ) using active ingredient instead of drug. It seems to improve a bit the performance when using k-fold cross-validation and a random forest, probably because it makes each tree a bit more diverse.
A simplified version of my final flow looks like this. If you want access to this DSS project, send me an email.
It seems possible to construct basic fraud models using the Medicare data. Of course, the performance is not very good due to the lack of labelled data.
However, we still get nice insights on the most used drug for medicare fraud. The model could also be used as a help for medicare fraud detection. Since we still need people to actually look closely to detect fraud pattern, checking first NPIs with high score makes sense. Indeed, if we analyse the Lift curve below, we realise that by looking at the top 10% highly scored physicians, we retrieve 60% of fraudsters.
We focused on supervised machine learning but unsupervised methods could be of interest. Because we have very few labels, it could be worth testing more advanced outliers detection technics.
For example, in this very nice analysis, Ofer Mendelevitch (Director Data Science at Hortonworks) ran a modified version of PageRank algorithm on the physician graph, and is able to detect doctors whose prescriptions pattern seem odd. This report also gives other ideas to go deeper into the fraud analysis.
If you have features engineering ideas to test or know about other analysis on the medicare dataset, let me know @prrgutierrez !