Data source

CMS Complete 2017 Program Year Open Payments Dataset

To start, we need to find the file to load into memory. Fortunately, it's easy to run command-line commands directly in a notebook cell:

In [5]:
!ls -lh
total 14615640
-rw-rw----@ 1 paul.garaud  admin   5.9G Jun 17  2018 OP_DTL_GNRL_PGYR2017_P06292018.csv
-rw-rw----@ 1 paul.garaud  admin   1.0M Jun 17  2018 OP_DTL_OWNRSHP_PGYR2017_P06292018.csv
-rw-rw----@ 1 paul.garaud  admin   501M Jun 17  2018 OP_DTL_RSRCH_PGYR2017_P06292018.csv
-rwxrwx---@ 1 paul.garaud  admin   4.2K Jun 15  2018 OP_PGYR2017_README_P06292018.txt
-rw-r--r--@ 1 paul.garaud  admin   536M Nov  8  2018 PGYR17_P062918.ZIP
-rw-r--r--@ 1 paul.garaud  admin   465K Apr  8 08:30 cms_data_exploration.html
-rw-r--r--  1 paul.garaud  admin   145K May 17 15:49 cms_data_exploration.ipynb
-rw-r--r--@ 1 paul.garaud  admin    12K Apr  8 08:28 cms_data_exploration.md
drwxr-xr-x  6 paul.garaud  admin   192B Apr 24 14:06 cms_data_exploration_files
-rw-r--r--  1 paul.garaud  admin    12K Apr  8 08:58 jupyter_notebook_article.txt
In [9]:
file_name = 'OP_DTL_RSRCH_PGYR2017_P06292018.csv'  # you can tab-complete files

Let's have a quick peek at the first few rows of data before attempting to load the data. Let's first print out the first 10 column names and then grab the values for these columns for the first 10 records.

(Note how file_name is available as an environment variable because we defined it in the above Python cell.)

In [18]:
!head -n 1 $file_name | tr "," "\n" | head -n 10
Change_Type
Covered_Recipient_Type
Noncovered_Recipient_Entity_Name
Teaching_Hospital_CCN
Teaching_Hospital_ID
Teaching_Hospital_Name
Physician_Profile_ID
Physician_First_Name
Physician_Middle_Name
Physician_Last_Name
In [21]:
!head -n 10 $file_name | cut -d "," -f 1-10
Change_Type,Covered_Recipient_Type,Noncovered_Recipient_Entity_Name,Teaching_Hospital_CCN,Teaching_Hospital_ID,Teaching_Hospital_Name,Physician_Profile_ID,Physician_First_Name,Physician_Middle_Name,Physician_Last_Name
"NEW","Covered Recipient Physician",,,,,"346085","JAMES",,"BAILEN"
"NEW","Covered Recipient Physician",,,,,"486954","SHIWALI",,"RAI"
"NEW","Covered Recipient Physician",,,,,"346085","JAMES",,"BAILEN"
"NEW","Covered Recipient Physician",,,,,"159800","DANIEL",,"SMALL"
"NEW","Covered Recipient Physician",,,,,"346085","JAMES",,"BAILEN"
"NEW","Covered Recipient Physician",,,,,"1120354","FRANK",,"LEE"
"NEW","Covered Recipient Physician",,,,,"284992","ROGER",,"BEYER"
"NEW","Covered Recipient Physician",,,,,"529075","JOHN",,"MURRAY"
"NEW","Covered Recipient Physician",,,,,"302827","JEFFREY",,"MICHELSON"

Since the file seems to be formatted properly and matches its extension, let's import a couple of Python libraries and read the data into memory.

In [23]:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

sns.set_style('white')
pd.options.display.float_format = '{:,.2f}'.format

df = pd.read_csv('OP_DTL_RSRCH_PGYR2017_P06292018.csv', low_memory=False)

When working with files containing dozens of columns, it can be helpful to search for a known substring to identify columns of interest.

In [24]:
df.columns[df.columns.str.contains('Pay')]
Out[24]:
Index(['Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_ID',
       'Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name',
       'Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_State',
       'Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Country',
       'Total_Amount_of_Payment_USDollars', 'Date_of_Payment',
       'Form_of_Payment_or_Transfer_of_Value', 'Payment_Publication_Date'],
      dtype='object')

Let's explore how payments differ by recipient country.

In [25]:
# some basic munging
df.loc[:, 'Recipient_Country'] = df.Recipient_Country.fillna('Unknown')

# summary table of recipient countries, sorted by mean payment
df.groupby('Recipient_Country')\
    .agg({'Total_Amount_of_Payment_USDollars': [pd.np.count_nonzero, pd.np.sum, pd.np.mean]})\
    .sort_values(by=('Total_Amount_of_Payment_USDollars', 'mean'), ascending=False)
Out[25]:
Total_Amount_of_Payment_USDollars
count_nonzero sum mean
Recipient_Country
Belgium 16.00 245,809.02 15,363.06
Canada 43.00 580,523.28 13,500.54
United States 606,747.00 4,655,011,467.42 7,672.08
Germany 1.00 6,909.00 6,909.00
Unknown 647.00 2,832,901.02 4,378.52
Poland 1.00 3,113.44 3,113.44
United States Minor Outlying Islands 3.00 5,436.84 1,812.28
Great Britain (Uk) 405.00 241,703.85 596.80
Denmark 1.00 562.72 562.72
Japan 1.00 48.00 48.00

Unsurprisingly, the lion's share of payments went to the United States. Let's dig a bit deeper into non-US (and non-Unknown) payments. One thing we can do is look at payments over time. Are there any trends in payments to foreign countries?

Before we proceed, we should ensure that the column Date_of_Payment is stored as datetime64 (as opposed to a str/object) as pandas has a lot of functionality available specifically for properly stored datetime data.

In [26]:
print(f'Date_of_Payment is of type <{df.Date_of_Payment.dtype}>.')
df.loc[:, 'Date_of_Payment'] = pd.to_datetime(df.Date_of_Payment, format='%m/%d/%Y')
print(f'Date_of_Payment is of type <{df.Date_of_Payment.dtype}>.')
Date_of_Payment is of type <object>.
Date_of_Payment is of type <datetime64[ns]>.

For the purposes of charting, will be helpful to "pivot" the data such that each column represents the payments for a single country and the rows represent time.

At the moment, our data looks like this:

In [27]:
df.loc[:5, ['Date_of_Payment', 'Recipient_Country', 'Total_Amount_of_Payment_USDollars']]
Out[27]:
Date_of_Payment Recipient_Country Total_Amount_of_Payment_USDollars
0 2017-10-12 United States 1,100.00
1 2017-12-11 United States 535.50
2 2017-11-15 United States 100.00
3 2017-09-14 United States 875.00
4 2017-11-15 United States 100.00
5 2017-12-20 United States 162.00

After pivoting the country values into columns, the data is ready for graphing.

In [28]:
payments_by_country = df.loc[~df.Recipient_Country.str.contains('United States') & (df.Recipient_Country != 'Unknown')]\
    .groupby(['Date_of_Payment', 'Recipient_Country'])\
    .Total_Amount_of_Payment_USDollars.sum()\
    .reset_index()\
    .pivot(index='Date_of_Payment', columns='Recipient_Country', values='Total_Amount_of_Payment_USDollars')
payments_by_country.head()
Out[28]:
Recipient_Country Belgium Canada Denmark Germany Great Britain (Uk) Japan Poland
Date_of_Payment
2017-01-03 nan 48,775.70 nan nan nan nan nan
2017-01-08 nan 3,530.29 nan nan nan nan nan
2017-01-12 nan 2,459.99 nan nan nan nan nan
2017-01-17 nan 47,131.00 nan nan nan nan nan
2017-01-19 nan 710.52 nan nan 79.08 nan nan

Since we're dealing with payments, and since the distribution of payments are often highly skewed, it may be helpful to apply a log transformation first.

In [29]:
# payments for non-US countries over time
axes = payments_by_country.plot()
axes.set_yscale('log')
plt.ylabel('Total Payment (log)')
plt.legend(loc='best', ncol=3)
plt.show()

Yikes! It appears the data is quite sparse for many of the countries. Perhaps aggregating by month will improve things...

In [30]:
# payments for non-US countries over time
axes = payments_by_country\
    .groupby(pd.Grouper(freq='M'))\
    .sum()\
    .plot()
axes.set_yscale('log', nonposy='mask')
plt.ylabel('Total Payment (log)')
plt.legend(loc='best', ncol=3)
plt.show()

While it is better for countries with regular payments (Canada, Great Britain), many of the other countries still have rather sparse data.

Let's turn our attention to the distribution of (log) payments. Do payments to foreign countries differ from payments to the US?

In [31]:
axes = sns.distributions.distplot(
    pd.np.log(df.loc[df.Recipient_Country == 'United States', 'Total_Amount_of_Payment_USDollars']),
    kde=True, hist=False, rug=True, rug_kws={'alpha': 0.01}
)
axes.plot()
plt.ylabel('Frequency')
plt.xlabel('Total Payment (log)')
plt.show()
In [32]:
has_multiple_payments = payments_by_country.apply(lambda c: pd.notna(c).sum() > 1, axis=0)
has_multiple_payments
Out[32]:
Recipient_Country
Belgium                True
Canada                 True
Denmark               False
Germany               False
Great Britain (Uk)     True
Japan                 False
Poland                False
dtype: bool
In [33]:
# dist by non-US country
f, axes = plt.subplots(1, 3, figsize=(7, 7), sharex=True, sharey=True)
sns.despine(left=True)

for i, country in enumerate(payments_by_country.loc[:, has_multiple_payments].columns):
    sns.distributions.distplot(pd.np.log(payments_by_country.loc[:, country].dropna()),
                               kde=True, hist=False, rug=True, rug_kws={'alpha': 0.5},
                               ax=axes[i % 3])
plt.tight_layout()
plt.show()

Hopefully this has given you a sense of what using a Jupyter Notebook is like and a small sampling of the rich functionality it provides.