CMS Complete 2017 Program Year Open Payments Dataset
To start, we need to find the file to load into memory. Fortunately, it's easy to run command-line commands directly in a notebook cell:
!ls -lh
file_name = 'OP_DTL_RSRCH_PGYR2017_P06292018.csv' # you can tab-complete files
Let's have a quick peek at the first few rows of data before attempting to load the data. Let's first print out the first 10 column names and then grab the values for these columns for the first 10 records.
(Note how file_name
is available as an environment variable because we defined it in the above Python cell.)
!head -n 1 $file_name | tr "," "\n" | head -n 10
!head -n 10 $file_name | cut -d "," -f 1-10
Since the file seems to be formatted properly and matches its extension, let's import a couple of Python libraries and read the data into memory.
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
sns.set_style('white')
pd.options.display.float_format = '{:,.2f}'.format
df = pd.read_csv('OP_DTL_RSRCH_PGYR2017_P06292018.csv', low_memory=False)
When working with files containing dozens of columns, it can be helpful to search for a known substring to identify columns of interest.
df.columns[df.columns.str.contains('Pay')]
Let's explore how payments differ by recipient country.
# some basic munging
df.loc[:, 'Recipient_Country'] = df.Recipient_Country.fillna('Unknown')
# summary table of recipient countries, sorted by mean payment
df.groupby('Recipient_Country')\
.agg({'Total_Amount_of_Payment_USDollars': [pd.np.count_nonzero, pd.np.sum, pd.np.mean]})\
.sort_values(by=('Total_Amount_of_Payment_USDollars', 'mean'), ascending=False)
Unsurprisingly, the lion's share of payments went to the United States. Let's dig a bit deeper into non-US (and non-Unknown) payments. One thing we can do is look at payments over time. Are there any trends in payments to foreign countries?
Before we proceed, we should ensure that the column Date_of_Payment
is stored as datetime64
(as opposed to a str
/object
) as pandas
has a lot of functionality available specifically for properly stored datetime data.
print(f'Date_of_Payment is of type <{df.Date_of_Payment.dtype}>.')
df.loc[:, 'Date_of_Payment'] = pd.to_datetime(df.Date_of_Payment, format='%m/%d/%Y')
print(f'Date_of_Payment is of type <{df.Date_of_Payment.dtype}>.')
For the purposes of charting, will be helpful to "pivot" the data such that each column represents the payments for a single country and the rows represent time.
At the moment, our data looks like this:
df.loc[:5, ['Date_of_Payment', 'Recipient_Country', 'Total_Amount_of_Payment_USDollars']]
After pivoting the country values into columns, the data is ready for graphing.
payments_by_country = df.loc[~df.Recipient_Country.str.contains('United States') & (df.Recipient_Country != 'Unknown')]\
.groupby(['Date_of_Payment', 'Recipient_Country'])\
.Total_Amount_of_Payment_USDollars.sum()\
.reset_index()\
.pivot(index='Date_of_Payment', columns='Recipient_Country', values='Total_Amount_of_Payment_USDollars')
payments_by_country.head()
Since we're dealing with payments, and since the distribution of payments are often highly skewed, it may be helpful to apply a log transformation first.
# payments for non-US countries over time
axes = payments_by_country.plot()
axes.set_yscale('log')
plt.ylabel('Total Payment (log)')
plt.legend(loc='best', ncol=3)
plt.show()
Yikes! It appears the data is quite sparse for many of the countries. Perhaps aggregating by month will improve things...
# payments for non-US countries over time
axes = payments_by_country\
.groupby(pd.Grouper(freq='M'))\
.sum()\
.plot()
axes.set_yscale('log', nonposy='mask')
plt.ylabel('Total Payment (log)')
plt.legend(loc='best', ncol=3)
plt.show()
While it is better for countries with regular payments (Canada, Great Britain), many of the other countries still have rather sparse data.
Let's turn our attention to the distribution of (log) payments. Do payments to foreign countries differ from payments to the US?
axes = sns.distributions.distplot(
pd.np.log(df.loc[df.Recipient_Country == 'United States', 'Total_Amount_of_Payment_USDollars']),
kde=True, hist=False, rug=True, rug_kws={'alpha': 0.01}
)
axes.plot()
plt.ylabel('Frequency')
plt.xlabel('Total Payment (log)')
plt.show()
has_multiple_payments = payments_by_country.apply(lambda c: pd.notna(c).sum() > 1, axis=0)
has_multiple_payments
# dist by non-US country
f, axes = plt.subplots(1, 3, figsize=(7, 7), sharex=True, sharey=True)
sns.despine(left=True)
for i, country in enumerate(payments_by_country.loc[:, has_multiple_payments].columns):
sns.distributions.distplot(pd.np.log(payments_by_country.loc[:, country].dropna()),
kde=True, hist=False, rug=True, rug_kws={'alpha': 0.5},
ax=axes[i % 3])
plt.tight_layout()
plt.show()
Hopefully this has given you a sense of what using a Jupyter Notebook is like and a small sampling of the rich functionality it provides.