Policy Evaluation with Python

Workflow

As a data analyst, it is absolutely essential that you keep detailed documentation of your analysis process. This means saving all of the code that you use for a project somewhere convenient so that you or anyone else can quickly and easily exactly reproduce your original results. In addition to saving your code, you should also provide explanations for your procedures and code so that you will always remember your data sources, what each line of code is doing, and why you analyzed the data the way that you did. Not only will this help you remember the details about your project, but it will also make it easy for others to read through your code and validate your methods. Practically, your code should be saved in a script file, though I strongly recommend using a markdown document since it allows you to create documents with written prose and code at the same time, exactly like this guide!

To create a Jupyter Notebook, open Terminal on MacOS or the Anaconda Prompt on Windows and type jupyter notebook (there are many different ways to create a Jupyter notebook).

After running this command, your browser will open with an instance of Jupyter. From there, click on NEW then select the Python 3 kernel.

Step 1: Get the Policy Data

About the dataset

The Correlates of State Policy Project includes more than 900 variables, with observations across the 50 U.S. states and across time (1900–2016, approximately). These variables represent policy outputs or political, social, or economic factors that may influence policy differences. The codebook includes the variable name, a short description of the variable, the variable time frame, a longer description of the variable, and the variable source(s) and notes.

Begin by downloading the Correlates of State Policy Project codebook. After the file is downloaded, open it and review the table of contents. The table of contents simply lists the categories of variables available in the data. Scrolling down to page 3, we see examples of types of data (or variables) that are available under each category:

![](/media/Examples of Variable.png)

Unit of Analysis and Data Type

Before you begin thinking about your research question, you first need to identify what type of data the Correlates of State Policy Project are and what the unit of analysis is. Does the data span multiple years or is it just a single year? Do the variables contain data on individuals, ZIP codes, counties, Congressional districts, states, countries, …? This is an essential first step because the answers to these questions will narrow the range of research questions that you can answer.

For example, suppose we were interested in knowing if going to college leads to increases in pay rates. In addition to a research design that would allow us to identify this causal effect, we would need data on individual people; whether or not they have a college degree and what their pay rates are. With this data, the unit of analysis would be the individual-level.

Let’s suppose we have this data. Now we need to decide if we need cross-sectional data, or panel data. Cross-sectional data are data that only contain a single point in time. An example of this would be a survey of a randomly sampled respondents in year, say, 2020. Panel data (also known as longitudinal data) contains data on the same variables and units across multiple years. An example of this would be if we surveyed the same individuals every year for 5 years. Thus, we would have multiple measurements of a variable for the same individual over multiple time points. Note that we don’t need data on individuals to construct a panel dataset. We could also have a panel of, say, the GDP of all 50 states over a 10 year period.

What if we have data for multiple years, but the variables aren’t measuring the same individuals? In this case, we would have a pooled-cross-section. For example, the General Social Survey (GSS) conducts surveys of a random sample of individuals every year, but they don’t survey the same individuals. This means that we cannot compare the values of a variable in one year to the values of the same variable in another year, since each year is measuring a different person.

Getting back to the Correlates of State Policy Project data, we can see in the “About the Correlates of State Policy Project” section that the data measures all 50 states from 1900 to 2018. Thus, the unit of analysis is at the state-level and because it measures states (the same states) over time, it is a panel dataset. This should orient our research question to one about state-level outcomes and programs.

Research Question and Theory

In this guide, I will consider the following research question:

Do policies that ban financial incentives for doctors to preform less costly procedures and prescribe less costly drugs affect per capita state spending on health care expenditures?

In addition to a precise research question, you also need to develop an explanation for why you think these two variables are related. For my research question, my theory might be that incentivising doctors to preform less costly procedures and prescribe less costly drugs will make them more likely to do these things which will ultimately lead to a reduction in, both private and public, health care expenditures.

Selecting Variables

In deciding which variables we need to answer our research question, we need to consider three different categories of variables: the outcome variable, the predictor of interest (usually the policy intervention or treatment), and controls.

The Outcome and Predictor of Interest

For my question, policies banning financial incentives for doctors will be my predictor of interest (or treatment) and state health care expenditures will be my outcome. Here are the descriptions given to each of these variables in the codebook:

Variable Name Variable Description Variable Coding Dates Available
banfaninc Did state adopt a ban on financial incentives for doctors to perform less costly procedures/prescribe less costly drugs? 0 = no, 1 = yes 1996 - 2001
healthspendpc Health care expenditures per capita (in dollars), measuring spending for all privately and publicly funded personal health care services and products (hospital care, physician services, nursing home care, prescription drugs, etc.) by state of residence. in dollars 1991 - 2009

There are a few things to note about this information. First, data are available for both of these variables from 1996 to 2001. If the years of collection for these variables didn’t overlap, then we wouldn’t be able to use them to answer our research question. Second, banfaninc is an indicator variable (also known as a dummy variable) that simply indicates a binary, “yes” “no” status. Conversely, healthspendpc is continuous meaning that it could be any positive number (since it is measured in dollars). Note that when working with money as a variable, it is important to control for inflation!

Controls

Now that we have the outcome and treatment, we need to think about what controls we need. The important criterion for identifying control variables is to pick variables that affect both the treatment and outcome.

For my question I will need to identify variables that affect policies that prohibit financial incentives and affect a state’s health care spending per capita. Here are some examples:

Variable Name Variable Description Variable Coding Dates Available
sen_cont_alt Party in control of state senate 1 = Democrats, 0 = Republicans, .5 = split 1937 – 2011
hs_cont_alt Party in control of state house 1 = Democrats, 0 = Republicans, .5 = split 1937 – 2011
hou_chamber State house ideological median Scale is -1 (liberal) to +1 (conservative) 1993 - 2016
sen_chamber State senate ideological median Scale is -1 (liberal) to +1 (conservative) 1993 - 2016

Perhaps Democrats are more likely to regulate the health care industry including policies about prescription drugs and they are also more likely than Republicans to allocate public funds for health care.

Primary Keys

Finally, in addition to your outcome, predictor of interest, and control variables, you need to keep variables that uniquely identify each row of data. For this dataset, this will be the state and the year. This means that the level of data we have is “state-year” because each row of data contains measures for a given state and a given year. This also means that the state and year variables combine uniquely identify each row of data.

Now that we know what variables we need, we can start working with the data. Download the .csv (linked here) file containing the data and save it to your project’s folder.

Step 2: Prepare the Data for Analysis

Preparing your data for analysis is easily the most time-consuming component of any project and it’s often referred to as “data cleaning,” “data wrangling,” or “data munging.” It is essential that we know the data inside and out: what the unit of analysis is, what the variables we need are measuring, how many missing values we have and where they come from, what the distributions of the variables look like, etc. Without such a detailed understanding, we could easily reach flawed conclusions, build bad statistical models, or otherwise produce errors in our analysis.

Importing Data

Now, let’s get the data loaded into Python and start cleaning. To do so, we’ll use the pandas library, which you should already have some familiarity with.

import pandas as pd

policy_data_raw = pd.read_csv("correlatesofstatepolicyprojectv2.csv")

policy_data_raw.head()
/Users/nickjenkins/opt/anaconda3/lib/python3.9/site-packages/IPython/core/interactiveshell.py:3444: DtypeWarning: Columns (36,485,495,496,497,498,499,500,501,502,505,511,513,516,518,521,522,525,532,537,539,541,543,544,671,843,844,1087,1093,1124,1318,1513,1619,1620,1890) have mixed types.Specify dtype option on import or set low_memory=False.
  exec(code_obj, self.user_global_ns, self.user_ns)

year st stateno state state_fips state_icpsr poptotal popdensity popfemale pctpopfemale ... nwcirr regulation_boehmke_cogrowman regulation_boehmke_livingwill regulation_housing_directstateai regulation_housing_enabling_federal_aid regulation_rent_control regulation_rfra regulation_sedition_laws regulations_lemonlaw regulations_state_debt_limitatio
0 1900 AL 1.0 Alabama 1 41 1830000.0 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 1900 AK 2.0 Alaska 2 81 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 1900 AZ 3.0 Arizona 4 61 124000.0 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 1900 AR 4.0 Arkansas 5 42 1314000.0 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 1900 CA 5.0 California 6 71 1490000.0 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 2091 columns

When importing your data for the first time, it’s a good idea to save the original data to an object that denotes an unchanged file. I usually do this by adding the suffix _raw to the data object and and changes that I make to the data to clean it will be saved in a different object. That way I can always get back to the original data if I need to without re-importing it.

Although we are working with a .csv file for this tutorial, that won’t always be the case. Luckily pandas has you covered. Here is a table of pandas functions to read in other data types:

File Type Python Function
CSV pandas.read_csv
Excel (.xls & .xlsx) pandas.read_excel
SPSS pandas.read_spss
Stata pandas.read_stata
SAS pandas.read_sas
Feather pandas.read_feather

There are a lot of variables in this dataset.

# number of columns
policy_data_raw.shape[1]

# number of rows
policy_data_raw.shape[0]
6120

There are 2,091 variables (assuming each column is one variable)! Let’s make this more manageable by selecting only the variables we actually care about. We’ll do this by indexing the DataFrame with the columns that we want:

policy_data = policy_data_raw[["year", "state", "healthspendpc", "banfaninc", "sen_cont_alt", "hs_cont_alt",
                               "hou_chamber", "sen_chamber"]]

policy_data.head()

year state healthspendpc banfaninc sen_cont_alt hs_cont_alt hou_chamber sen_chamber
0 1900 Alabama NaN NaN NaN NaN NaN NaN
1 1900 Alaska NaN NaN NaN NaN NaN NaN
2 1900 Arizona NaN NaN NaN NaN NaN NaN
3 1900 Arkansas NaN NaN NaN NaN NaN NaN
4 1900 California NaN NaN NaN NaN NaN NaN

Now that our data is in a more manageable size, we can start to think more critically about it. One of the first things we want to know is if the data is in a usable format. What is a usable format? Well, we want it to know what is known as “tidy.” The concept of tidy data is a philosophy of data first explained in an article in the Journal of Statistical Software by Hadley Wickham called Tidy Data. The tidy data philosophy has three components:

  1. Each variable has its own column.
  2. Each observation has its own row.
  3. Each value has its own cell.

Visually, these three principles look like this:

Source: R for Data Science Ch. 12

There is a lot more to be said about tidy data and I encourage you to read the article or Chapter 12 of R for Data Science.

Reflect:

Does our data meet the three conditions of tidy data? Take a minute to think about it.

I say yes. This data looks good. If it wasn’t, we might need to think about how to reshape our data.

Missing Data

Missing data present a challenge to data scientists because they can influence analyses. When we import a dataset that has empty cells, R will automatically populate these cells with NA which means missing. These are known as explicit missing values values because we can see them in the data. When you go to create a plot or estimate a regression, or anything else, these values will be dropped from the dataset. That’s not necessarily a bad thing, but it’s important to know that it’s happening.

The more problematic situation is when missing data is represented with a numeric value like -99 instead of simply having an empty cell. In that case, R won’t know that -99 is a missing value - it will just think that it’s a legitimate value. If you were to run your regression with the -99s in the data, it would seriously influence your results. Before making any inferences with our data, we need to make sure we don’t have any values like this in the dataset!

If we knew about values like this ahead of time, we could fix them with the na_values argument of the pd.read_csv() function. Usually the codebook would tell us how missing values are coded. If there were coded as -99, we could fix them on import with:

data_raw = pd.read_csv("correlatesofstatepolicyprojectv2_2.csv", na_values = -99)

Looking through the codebook, I don’t see any mention of a missing value coding scheme, so we’ll need to look for unusual values in the data with some descriptive statistics:

policy_data.describe()

year healthspendpc banfaninc sen_cont_alt hs_cont_alt hou_chamber sen_chamber
count 6120.000000 969.000000 4900.000000 3638.000000 3639.000000 1007.000000 1018.000000
mean 1959.500000 4513.422631 0.078980 0.570093 0.606760 0.067393 0.069445
std 34.642644 1567.088257 0.269734 0.489827 0.485149 0.616347 0.602112
min 1900.000000 75.529700 0.000000 0.000000 0.000000 -1.465000 -1.449000
25% 1929.750000 3198.000000 0.000000 0.000000 0.000000 -0.462500 -0.462250
50% 1959.500000 4198.000000 0.000000 1.000000 1.000000 0.195000 0.195000
75% 1989.250000 5683.000000 0.000000 1.000000 1.000000 0.605000 0.613000
max 2019.000000 10349.000000 1.000000 1.000000 1.000000 1.228000 1.242000

The minimums and maximums all look reasonable here so I think we’re ok. It would also be a good idea to create some plots as another way to check for unusual values.

The other type of missing values to be aware of are “implicitly” missing values. To get a sense of what an implicit missing value is, take a look at this example DataFrame:

df = pd.DataFrame(
    {"state": ["CA", "CA", "AZ", "AZ", "TX"],
    "year": [2000, 2001, 2000, 2001, 2000],
    "enrollments": [1521, 1231, 3200, 2785, 6731]}
)
df.head()

state year enrollments
0 CA 2000 1521
1 CA 2001 1231
2 AZ 2000 3200
3 AZ 2001 2785
4 TX 2000 6731

What do you notice about Texas (TX)? It only has data for 2000. Why don’t we have data for 2001? Well, I don’t know! And that’s the problem. I could be random, but it might not be. Moreover, because the row for Texas in the year 2001 doesn’t exist in the data it’s technically missing even though we don’t see a NaN value that represents it. That’s why its implicitly, not explicitly, missing.

We can make implicitly missing values explicit in Python with the .complete() method from pyjanitor:

import janitor
df.complete("state", "year")

state year enrollments
0 CA 2000 1521.0
1 CA 2001 1231.0
2 AZ 2000 3200.0
3 AZ 2001 2785.0
4 TX 2000 6731.0
5 TX 2001 NaN

Getting back to our data, we’ve seen that there aren’t any extreme values, but which columns have missing values? We can check with the .isnull() and .any() methods:

policy_data.isnull().any()
year             False
state            False
healthspendpc     True
banfaninc         True
sen_cont_alt      True
hs_cont_alt       True
hou_chamber       True
sen_chamber       True
dtype: bool

We would also want to check how many missing values there are in each column:

policy_data.isnull().sum()
year                0
state               0
healthspendpc    5151
banfaninc        1220
sen_cont_alt     2482
hs_cont_alt      2481
hou_chamber      5113
sen_chamber      5102
dtype: int64

To check for any implicitly missing values, we can count how many years each state has. If all the missing values are explicit, then there should be the same number of rows for each state. We can check by chaining multiple methods together like this:

policy_data.groupby(["state"]).agg("count")["year"]
state
Alabama                 120
Alaska                  120
Arizona                 120
Arkansas                120
California              120
Colorado                120
Connecticut             120
Delaware                120
District of Columbia    120
Florida                 120
Georgia                 120
Hawaii                  120
Idaho                   120
Illinois                120
Indiana                 120
Iowa                    120
Kansas                  120
Kentucky                120
Louisiana               120
Maine                   120
Maryland                120
Massachusetts           120
Michigan                120
Minnesota               120
Mississippi             120
Missouri                120
Montana                 120
Nebraska                120
Nevada                  120
New Hampshire           120
New Jersey              120
New Mexico              120
New York                120
North Carolina          120
North Dakota            120
Ohio                    120
Oklahoma                120
Oregon                  120
Pennsylvania            120
Rhode Island            120
South Carolina          120
South Dakota            120
Tennessee               120
Texas                   120
Utah                    120
Vermont                 120
Virginia                120
Washington              120
West Virginia           120
Wisconsin               120
Wyoming                 120
Name: year, dtype: int64

What should we do with missing values?

There really isn’t a “right” way to handle missing values, so you just need to think carefully about whatever decision you make. The solution that we’ll use here is just to drop all missing values. Why? Missing values can’t be put in a plot or used in a regression, so we don’t necessarily need them. Theoretically, missing values are important because it could mean something more that just “no data.” For example, data might be systematically missing for a particular group of people. That would definitely affect our analyses and we would want to address it. That’s a very different situation than having a missing value for a random, non-systematic reason.

Let’s assume that we’ve spent time thinking about why our data is missing and we have decided that dropping the missing values from our dataset is the best option. We can do that with the .dropna() method:

policy_data = policy_data.dropna()

However, it is a much better idea to only drop rows that are missing the outcome variable. Doing so will ensure that we keep the maximum amount of usable data.

policy_data = policy_data.dropna(subset = ["healthspendpc"])

After dropping NaN values, it’s a good idea to check that it worked:

policy_data.isnull().any()
year             False
state            False
healthspendpc    False
banfaninc        False
sen_cont_alt     False
hs_cont_alt      False
hou_chamber      False
sen_chamber      False
dtype: bool

Renaming Variables

With the missing values out of the way, let’s turn to assigning more informative names to our variables. The variable names in our dataset aren’t too bad, but they could be better. It’s fairly common to run into cryptic variable names like CMP23A__1 when working with data and names like that only make it harder for you, as the data scientist, and anyone else trying to understand the data. In those situations it a good idea to give your variables names that are more easily understood.

There are some rules (more like guidelines) for variable names, however. First, we never want to have spaces. So, Senate Chamber would be a bad name because of the space. Spaces just make it harder to work with. Instead, we use _ instead of a space. Second, variables are easier to work with when they are all lower case. Python is a case sensitive language. That means that it sees x as different from X. So, instead of Senate_Chamber, we’ll just use senate_chamber. Finally, we want to try not to have variable names that are too long. Sometimes it’s hard to use short names when the variables are complicated, but do your best. Working with senate_chamber_california_legislative_session_2022 is a bit cumbersome to type out.

Reflect:

What does the sen_chamber variable measure?

If you had to go look it up, it probably needs a new name! To rename variables in pandas we use the .rename() method where we set the old name equal to the new name like this:

policy_data = policy_data.rename(columns = {"healthspendpc": "health_spend_per_cap",
                                            "banfaninc": "fin_incentive_ban",
                                            "sen_cont_alt": "senate_prty_control",
                                            "hs_cont_alt": "house_prty_control",
                                            "hou_chamber": "house_ideology",
                                            "sen_chamber": "senate_ideology"})
policy_data.head()

year state health_spend_per_cap fin_incentive_ban senate_prty_control house_prty_control house_ideology senate_ideology
4746 1993 Arkansas 2609.0 0.0 1.0 1.0 0.182 0.064
4747 1993 California 2948.0 0.0 1.0 1.0 -0.968 -0.720
4753 1993 Georgia 2867.0 0.0 1.0 1.0 -0.187 -0.270
4758 1993 Iowa 2837.0 0.0 1.0 0.0 0.102 -0.476
4783 1993 South Carolina 2612.0 0.0 1.0 1.0 0.214 0.052

Variable Types

Python allows for different types of data and it’s important to ensure that our variables have the right types. There are three main types that you will use as a data scientist:

  1. numeric: standard numeric data with some definied level of precision
  2. strings: character data like words or sentences (these are listed as objects in Python)
  3. categorical: data that has discrete categories, can be ordered or unordered

To check what types of variables we have in a DataFrame, we use the .dtypes() method:

policy_data.dtypes
year                      int64
state                    object
health_spend_per_cap    float64
fin_incentive_ban       float64
senate_prty_control     float64
house_prty_control      float64
house_ideology          float64
senate_ideology         float64
dtype: object

Let’s see if these data types make sense:

  • year: int64 which is a numeric type so that’s good
  • state: object which indicates a string
  • health_spend_per_cap: float64 which is another numeric type

These all seem good. But, fin_incentive_ban, senate_prty_control, and house_prty_control probably shouldn’t be a continuous numeric type like float64 because they are categories. Does the state have a financial incentive ban (fin_incentive_ban)? Yes or no? Who control the senate for a given year (senate_prty_control)? The Democrats or Republicans? So, these should be categorical data. We can change the type with the mutate function which is used to create new variables. Essentially, we are replacing the old fin_incentive_ban with a new categorical version of fin_incentive_ban:

policy_data = policy_data.assign(fin_incentive_ban = pd.Categorical(policy_data["fin_incentive_ban"]))

Now, if we check the data types again, we will see the fin_incentive_ban is now a categorical data type with category values of either 0 or 1:

policy_data.dtypes
year                       int64
state                     object
health_spend_per_cap     float64
fin_incentive_ban       category
senate_prty_control      float64
house_prty_control       float64
house_ideology           float64
senate_ideology          float64
dtype: object

Creating and Manipulating Variables

Alright, that looks good but we can make it better. What if instead of saying either 0 or 1, we had it say “Yes” or “No”? That would make the data easier to understand. Essentially, what we want to do is change the names of the categories. Luckily pandas has a method for that! It’s called .rename_categories(). Let’s see how it works.

policy_data

year state health_spend_per_cap fin_incentive_ban senate_prty_control house_prty_control house_ideology senate_ideology
4746 1993 Arkansas 2609.0 0.0 1.0 1.0 0.182 0.064
4747 1993 California 2948.0 0.0 1.0 1.0 -0.968 -0.720
4753 1993 Georgia 2867.0 0.0 1.0 1.0 -0.187 -0.270
4758 1993 Iowa 2837.0 0.0 1.0 0.0 0.102 -0.476
4783 1993 South Carolina 2612.0 0.0 1.0 1.0 0.214 0.052
... ... ... ... ... ... ... ... ...
5602 2009 Texas 5924.0 1.0 0.0 0.0 0.566 0.779
5604 2009 Vermont 7635.0 1.0 1.0 1.0 -0.754 -0.721
5606 2009 Washington 6782.0 0.0 1.0 1.0 -0.794 -0.623
5608 2009 Wisconsin 7233.0 0.0 1.0 1.0 -0.798 -0.880
5609 2009 Wyoming 7040.0 0.0 0.0 0.0 0.304 0.353

687 rows × 8 columns

policy_data["fin_incentive_ban"] = policy_data["fin_incentive_ban"].cat.rename_categories({0: "No", 1: "Yes"})
policy_data

year state health_spend_per_cap fin_incentive_ban senate_prty_control house_prty_control house_ideology senate_ideology
4746 1993 Arkansas 2609.0 No 1.0 1.0 0.182 0.064
4747 1993 California 2948.0 No 1.0 1.0 -0.968 -0.720
4753 1993 Georgia 2867.0 No 1.0 1.0 -0.187 -0.270
4758 1993 Iowa 2837.0 No 1.0 0.0 0.102 -0.476
4783 1993 South Carolina 2612.0 No 1.0 1.0 0.214 0.052
... ... ... ... ... ... ... ... ...
5602 2009 Texas 5924.0 Yes 0.0 0.0 0.566 0.779
5604 2009 Vermont 7635.0 Yes 1.0 1.0 -0.754 -0.721
5606 2009 Washington 6782.0 No 1.0 1.0 -0.794 -0.623
5608 2009 Wisconsin 7233.0 No 1.0 1.0 -0.798 -0.880
5609 2009 Wyoming 7040.0 No 0.0 0.0 0.304 0.353

687 rows × 8 columns

That’s much easier to understand. Still skeptical that labeling the values makes it easier? Pause and think about what a value of 1 means for senate_prty_control? Did you have to look it up? I know I did! If we had the value labeled, we wouldn’t need to look it up.

Now let’s use these tools to clean up the other categorical variables (senate_prty_control and house_prty_control). Before we do though, we need to check how many categories there are in these variables:

policy_data["senate_prty_control"].unique()
policy_data["house_prty_control"].unique()
array([1. , 0. , 0.5])

Ok, now let’s look in the codebook to figure out what these values mean and label them appropriately.

policy_data = policy_data.assign(senate_prty_control = pd.Categorical(policy_data["senate_prty_control"]),
                                 house_prty_control = pd.Categorical(policy_data["house_prty_control"]))

policy_data["senate_prty_control"] = policy_data["senate_prty_control"].cat.rename_categories({0: "Republican",
                                                                                               0.5: "Split",
                                                                                               1: "Democrat"})

policy_data["house_prty_control"] = policy_data["house_prty_control"].cat.rename_categories({0: "Republican",
                                                                                             0.5: "Split",
                                                                                             1: "Democrat"})

policy_data

year state health_spend_per_cap fin_incentive_ban senate_prty_control house_prty_control house_ideology senate_ideology
4746 1993 Arkansas 2609.0 No Democrat Democrat 0.182 0.064
4747 1993 California 2948.0 No Democrat Democrat -0.968 -0.720
4753 1993 Georgia 2867.0 No Democrat Democrat -0.187 -0.270
4758 1993 Iowa 2837.0 No Democrat Republican 0.102 -0.476
4783 1993 South Carolina 2612.0 No Democrat Democrat 0.214 0.052
... ... ... ... ... ... ... ... ...
5602 2009 Texas 5924.0 Yes Republican Republican 0.566 0.779
5604 2009 Vermont 7635.0 Yes Democrat Democrat -0.754 -0.721
5606 2009 Washington 6782.0 No Democrat Democrat -0.794 -0.623
5608 2009 Wisconsin 7233.0 No Democrat Democrat -0.798 -0.880
5609 2009 Wyoming 7040.0 No Republican Republican 0.304 0.353

687 rows × 8 columns

Perfect.

Sometimes when we have categorical data, we would rather work with a single value of the category instead of every possible category. An easy way to accomplish this is to create an indicator, or dummry, variable. For example, suppose we just wanted to know of the Senate was controlled by the Democrats in a given year. To do this, we could create an indicator variable equal to 1 if senate_prty_control equals “Democrat”. The NumPy library provides an easy way to do this with the .where() method:

import numpy as np

policy_data.assign(senate_dem_control = np.where(policy_data["senate_prty_control"] == "Democrat", 1, 0))

year state health_spend_per_cap fin_incentive_ban senate_prty_control house_prty_control house_ideology senate_ideology senate_dem_control
4746 1993 Arkansas 2609.0 No Democrat Democrat 0.182 0.064 1
4747 1993 California 2948.0 No Democrat Democrat -0.968 -0.720 1
4753 1993 Georgia 2867.0 No Democrat Democrat -0.187 -0.270 1
4758 1993 Iowa 2837.0 No Democrat Republican 0.102 -0.476 1
4783 1993 South Carolina 2612.0 No Democrat Democrat 0.214 0.052 1
... ... ... ... ... ... ... ... ... ...
5602 2009 Texas 5924.0 Yes Republican Republican 0.566 0.779 0
5604 2009 Vermont 7635.0 Yes Democrat Democrat -0.754 -0.721 1
5606 2009 Washington 6782.0 No Democrat Democrat -0.794 -0.623 1
5608 2009 Wisconsin 7233.0 No Democrat Democrat -0.798 -0.880 1
5609 2009 Wyoming 7040.0 No Republican Republican 0.304 0.353 0

687 rows × 9 columns

The dataset is looking really good, but there is one last thing to do. Because we are going to be using spending on health care as a variable in our model, we need to adjust it for inflation. In order to do that we need data from the Consumer Price Index and the formula:

$$ \text{Adjusted Value} = \frac{\text{Actual Value}}{\text{Index Value}} $$

To get the CPI data, we’ll use the CPI library. You’ll need to install it with pip install cpi or conda install cpi in your command line prompt. How much is $100 in 1950 worth today?

import cpi

cpi.inflate(value = 1, year_or_month = 1950)
11.243568464730291

What if you bought a hamburger for $0.25 in 1943 (the price of a McDonald’s hamburger and fries at the time)?

cpi.inflate(value = 0.25, year_or_month = 1943)
3.9157514450867055

Ok, let’s use cpi.inflate() to adjust the dollar values in our data and use 1996 and the year to adjust to because that’s the first year of our treatment. Note that we’ll need to use the .apply() method which applies a function to every row of a column with the axis = 1 argument:

policy_data["real_health_spend_per_cap"] = policy_data.apply(
    lambda x: cpi.inflate(x.health_spend_per_cap, x.year, to=1996), axis = 1
)

policy_data

year state health_spend_per_cap fin_incentive_ban senate_prty_control house_prty_control house_ideology senate_ideology real_health_spend_per_cap
4746 1993 Arkansas 2609.0 No Democrat Democrat 0.182 0.064 2832.886505
4747 1993 California 2948.0 No Democrat Democrat -0.968 -0.720 3200.977163
4753 1993 Georgia 2867.0 No Democrat Democrat -0.187 -0.270 3113.026298
4758 1993 Iowa 2837.0 No Democrat Republican 0.102 -0.476 3080.451903
4783 1993 South Carolina 2612.0 No Democrat Democrat 0.214 0.052 2836.143945
... ... ... ... ... ... ... ... ... ...
5602 2009 Texas 5924.0 Yes Republican Republican 0.566 0.779 4332.472254
5604 2009 Vermont 7635.0 Yes Democrat Democrat -0.754 -0.721 5583.799065
5606 2009 Washington 6782.0 No Democrat Democrat -0.794 -0.623 4959.964016
5608 2009 Wisconsin 7233.0 No Democrat Democrat -0.798 -0.880 5289.799429
5609 2009 Wyoming 7040.0 No Republican Republican 0.304 0.353 5148.650349

687 rows × 9 columns

Finally, when working with data that is strictly positive and continuous it is a good idea to use a $\log$ transformation to make the distribution a bit more normal. This helps with the model fit.

policy_data = policy_data.assign(log_health_spend_cap_adj = np.log(policy_data["real_health_spend_per_cap"]))

policy_data

year state health_spend_per_cap fin_incentive_ban senate_prty_control house_prty_control house_ideology senate_ideology real_health_spend_per_cap log_health_spend_cap_adj
4746 1993 Arkansas 2609.0 No Democrat Democrat 0.182 0.064 2832.886505 7.949051
4747 1993 California 2948.0 No Democrat Democrat -0.968 -0.720 3200.977163 8.071211
4753 1993 Georgia 2867.0 No Democrat Democrat -0.187 -0.270 3113.026298 8.043351
4758 1993 Iowa 2837.0 No Democrat Republican 0.102 -0.476 3080.451903 8.032832
4783 1993 South Carolina 2612.0 No Democrat Democrat 0.214 0.052 2836.143945 7.950201
... ... ... ... ... ... ... ... ... ... ...
5602 2009 Texas 5924.0 Yes Republican Republican 0.566 0.779 4332.472254 8.373894
5604 2009 Vermont 7635.0 Yes Democrat Democrat -0.754 -0.721 5583.799065 8.627625
5606 2009 Washington 6782.0 No Democrat Democrat -0.794 -0.623 4959.964016 8.509154
5608 2009 Wisconsin 7233.0 No Democrat Democrat -0.798 -0.880 5289.799429 8.573536
5609 2009 Wyoming 7040.0 No Republican Republican 0.304 0.353 5148.650349 8.546490

687 rows × 10 columns

Going Deeper: Data Cleaning in a Real Workflow

We did a lot to clean this data and we took our time doing. Obviously, we went slow so that you could learn each step and think critically about the code. In a real workflow, however, there’s a better way to write code than taking each step one-at-a-time and it’s referred to as method chaining. Method chaining is a way of writing code where we add multiple steps together, instead of writing each step individually. Here is what my code would like if I were doing these steps for an actual project:

# import libraries
import pandas as pd
import numpy as np

# import data
policy_data_raw = pd.read_csv("correlatesofstatepolicyprojectv2.csv")

# select variables, convert types, drop missing
policy_data = policy_data_raw \
                .loc[:, ["year", "state", "healthspendpc", "banfaninc", "sen_cont_alt", 
                               "hs_cont_alt", "hou_chamber", "sen_chamber"]] \
                .rename(columns = {"healthspendpc": "health_spend_per_cap", 
                                   "banfaninc": "fin_incentive_ban",
                                   "sen_cont_alt": "senate_prty_control",
                                   "hs_cont_alt": "house_prty_control",
                                   "hou_chamber": "house_ideology",
                                   "sen_chamber": "senate_ideology"}) \
                .assign(fin_incentive_ban = lambda x: pd.Categorical(x.fin_incentive_ban),
                        senate_prty_control = lambda x: pd.Categorical(x.senate_prty_control),
                        house_prty_control = lambda x: pd.Categorical(x.house_prty_control),
                        senate_dem_control = lambda x: np.where(x.senate_prty_control == "Democrat", 1, 0)) \
                .assign(senate_prty_control = lambda x: x.senate_prty_control.cat.rename_categories({0:"Republican", 0.5: "Split", 1: "Democrat"}),
                        house_prty_control = lambda x: x.house_prty_control.cat.rename_categories({0:"Republican", 0.5: "Split", 1: "Democrat"}))

# examine missing values
policy_data.describe()

# look for implicitly missing values
policy_data.groupby(["state"])["year"].agg("count")

# drop missing values
policy_data = policy_data.dropna(subset = ["health_spend_per_cap"])

# adjust for inflation
import cpi

policy_data = policy_data.assign(
        real_health_spend_per_cap = policy_data.apply(
                lambda x: cpi.inflate(x.health_spend_per_cap, x.year, to=1996), axis = 1
        ),
        log_real_health_spend_cap = lambda x: np.log(x.real_health_spend_per_cap)
)

policy_data
/Users/nickjenkins/opt/anaconda3/lib/python3.9/site-packages/IPython/core/interactiveshell.py:3444: DtypeWarning: Columns (36,485,495,496,497,498,499,500,501,502,505,511,513,516,518,521,522,525,532,537,539,541,543,544,671,843,844,1087,1093,1124,1318,1513,1619,1620,1890) have mixed types.Specify dtype option on import or set low_memory=False.
  exec(code_obj, self.user_global_ns, self.user_ns)

year state health_spend_per_cap fin_incentive_ban senate_prty_control house_prty_control house_ideology senate_ideology senate_dem_control real_health_spend_per_cap log_real_health_spend_cap
4641 1991 Alabama 2571.0000 0.0 Democrat Democrat NaN NaN 0 2961.746696 7.993534
4642 1991 Alaska 2589.0000 0.0 Split Democrat NaN NaN 0 2982.482379 8.000511
4643 1991 Arizona 2457.0000 0.0 Democrat Republican NaN NaN 0 2830.420705 7.948181
4644 1991 Arkansas 2380.0000 0.0 Democrat Democrat NaN NaN 0 2741.718062 7.916340
4645 1991 California 2677.0000 0.0 Democrat Democrat NaN NaN 0 3083.856828 8.033936
... ... ... ... ... ... ... ... ... ... ... ...
5605 2009 Virginia 6286.0000 0.0 Democrat Republican NaN NaN 0 4597.218195 8.433207
5606 2009 Washington 6782.0000 0.0 Democrat Democrat -0.794 -0.623 0 4959.964016 8.509154
5607 2009 West Virginia 75.5297 1.0 Democrat Democrat NaN NaN 0 55.238070 4.011652
5608 2009 Wisconsin 7233.0000 0.0 Democrat Democrat -0.798 -0.880 0 5289.799429 8.573536
5609 2009 Wyoming 7040.0000 0.0 Republican Republican 0.304 0.353 0 5148.650349 8.546490

969 rows × 11 columns

You might have noticed that I introduce a new operator here: lambda. The lambda operator is an annonoymous function that allows us to use the variables we create to calculate others. Check out this article to learn more: https://www.educba.com/pandas-lambda/.

For a real project, I would also save the cleaned data to my project’s working directory:

policy_data.to_csv("Policy Project/Data/policy_data.csv")

Step 3: Exploratory Data Analysis

Before running any models, it is important that you become an expert on your data. This is easy with a small dataset, but can get harder when you are working with a lot of variables. Here are some examples of questions you should ask about your data:

  • What do the distributions of my variables look like?
  • What are means and standard deviations?
  • What is the relationship between variables?

Answering these questions will help you identify errors in your data - errors that could lead to erroneous regression results, help you understand what your data measures and the types of questions you can as, and also help you generate hypotheses.

There are essentially two ways of conducting EDA: descriptive methods and visual methods. We’ll explore both here.

Descriptive Methods

When exploring data descriptively, we are mainly interested in calculating various statistics over the data and possibly within groups. In general, to conduct descriptive EDA we need to know how to:

  • Summarize and aggregate data
  • Filter data
  • Arrange data

Let’s see some examples of these.

Summarizing and Aggregating Data

With our interest in predicting healthcare spending, we might want to know what the average amount of healthcare spending is in the US. We’ve already seen how to do with the .describe() method, but we could also use .mean() and .median().

policy_data.real_health_spend_per_cap.mean()
3976.44511145127
policy_data.real_health_spend_per_cap.median()
3846.8745644599308

What if we want to know the average spending by state? To do that, we need to use .groupby() and .aggregate() methods to group the data into states, calculate the function we want within those states, and then show use the results for each state.

policy_data \
    .groupby("state") \
    .aggregate({"real_health_spend_per_cap": "mean"}) \
    .head(5)

real_health_spend_per_cap
state
Alabama 3792.283564
Alaska 4502.528923
Arizona 3222.403098
Arkansas 3552.535751
California 3596.155894

How does House ideology differ by states?

policy_data \
    .groupby("state") \
    .aggregate({"house_ideology": "mean"}) \
    .head(5)

house_ideology
state
Alabama 0.086357
Alaska 0.693143
Arizona 0.459429
Arkansas 0.231250
California -0.950059

Filtering Data

It’s important to be able to calculate statistics by groups, but sometimes we want to pull out specific values. For example, we might want to know which state had the highest amount of healthcare spending, or has the most liberal House. This is when we need filters. So, let’s see which state has the highest level of spending:

policy_data.loc[
    policy_data.real_health_spend_per_cap == policy_data.real_health_spend_per_cap.max(), 
    ["year", "state", "real_health_spend_per_cap"]
]

year state real_health_spend_per_cap
5567 2009 District of Columbia 7568.662282

Which state has the highest average spending? We’ll need to use our .groupby() skills for that. The state with the highest average spending is:

policy_data \
    .groupby("state") \
    .aggregate({"real_health_spend_per_cap": "mean"}) \
    .loc[:, "real_health_spend_per_cap"] \
    .nlargest(n=1)
state
District of Columbia    6302.281066
Name: real_health_spend_per_cap, dtype: float64

What if we only wanted to look at the data for California? We can do that like this:

policy_data.loc[policy_data.state == "California", :]

year state health_spend_per_cap fin_incentive_ban senate_prty_control house_prty_control house_ideology senate_ideology senate_dem_control real_health_spend_per_cap log_real_health_spend_cap
4645 1991 California 2677.0 0.0 Democrat Democrat NaN NaN 0 3083.856828 8.033936
4696 1992 California 2847.0 0.0 Democrat Democrat NaN NaN 0 3183.851033 8.065847
4747 1993 California 2948.0 0.0 Democrat Democrat -0.968 -0.720 0 3200.977163 8.071211
4798 1994 California 3008.0 0.0 Democrat Democrat -0.968 -0.720 0 3184.582996 8.066077
4849 1995 California 3069.0 0.0 Democrat Split 0.377 -0.773 0 3159.620079 8.058207
4900 1996 California 3144.0 1.0 Democrat Republican 0.377 -0.773 0 3144.000000 8.053251
4951 1997 California 3213.0 1.0 Democrat Democrat -0.848 -0.835 0 3140.932710 8.052275
5002 1998 California 3389.0 1.0 Democrat Democrat -0.848 -0.835 0 3262.172393 8.090149
5053 1999 California 3486.0 1.0 Democrat Democrat -0.956 -1.004 0 3283.033613 8.096523
5104 2000 California 3602.0 1.0 Democrat Democrat -0.956 -1.004 0 3281.961672 8.096197
5155 2001 California 3847.0 1.0 Democrat Democrat -1.069 -1.137 0 3408.211745 8.133943
5206 2002 California 4144.0 1.0 Democrat Democrat -1.069 -1.137 0 3614.194553 8.192624
5257 2003 California 4499.0 1.0 Democrat Democrat -1.295 -1.137 0 3836.375543 8.252283
5308 2004 California 4777.0 1.0 Democrat Democrat -1.295 -1.137 0 3967.767602 8.285959
5359 2005 California 5098.0 1.0 Democrat Democrat -1.207 -1.180 0 4095.628264 8.317675
5410 2006 California 5384.0 1.0 Democrat Democrat -1.207 -1.180 0 4190.226190 8.340510
5461 2007 California 5732.0 1.0 Democrat Democrat -1.391 -1.126 0 4337.523512 8.375059
5512 2008 California 6024.0 1.0 Democrat Democrat -1.391 -1.126 0 4389.932328 8.387069
5563 2009 California 6238.0 1.0 Democrat Democrat -1.437 -1.140 0 4562.113761 8.425541

What about all data for California just for the year 2000? For that we need to use the & operator and parentneses around each conditional:

policy_data.loc[
    (policy_data.state == "California") & (policy_data.year == 2000),
    :
]

year state health_spend_per_cap fin_incentive_ban senate_prty_control house_prty_control house_ideology senate_ideology senate_dem_control real_health_spend_per_cap log_real_health_spend_cap
5104 2000 California 3602.0 1.0 Democrat Democrat -0.956 -1.004 0 3281.961672 8.096197

Arranging Data

Instead of returning only one value, like the maximum, you might want to sort the data to see multile values. We can do this with the .sort.values() method. Instead of finding the state with the highest average health spending, let’s list out average spending by all states and put them in decending order.

policy_data \
    .groupby("state") \
    .aggregate({"real_health_spend_per_cap" : "mean"}) \
    .sort_values(by = ["real_health_spend_per_cap"]) \
    .head(10)

real_health_spend_per_cap
state
Utah 2918.212534
Arizona 3222.403098
Idaho 3229.638758
Nevada 3406.427351
Texas 3481.811702
Georgia 3483.831267
New Mexico 3497.550297
Virginia 3511.702684
Colorado 3516.180834
Arkansas 3552.535751

This is almost what we want. The issue is that the default of .sort_values() is ascending order. To change that to descending order, we set the ascending argument = False like so:

policy_data \
    .groupby("state") \
    .aggregate({"real_health_spend_per_cap" : "mean"}) \
    .sort_values(by = ["real_health_spend_per_cap"], ascending = False) \
    .head(10)

real_health_spend_per_cap
state
District of Columbia 6302.281066
Massachusetts 5083.730091
Connecticut 4880.184476
New York 4844.236550
Delaware 4631.491237
Rhode Island 4599.037152
Alaska 4502.528923
Pennsylvania 4463.792502
New Jersey 4456.569432
Maine 4430.814738

Visual Methods

There are several tools for data visualization in Python. pandas has its own built-in plotting methods, but other popular libraries are matplotlib and seaborn. In this guide, I’ll mostly be using seaborn because it is one of the more modern visualization libraries. As always, we start by importing it:

%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (20,10)

import seaborn as sns

Let’s start visualizing our data by looking at the distribution of real_health_spend_per_cap:

sns.displot(data = policy_data, x = "real_health_spend_per_cap")
<seaborn.axisgrid.FacetGrid at 0x7faf4e6124f0>

png

What if we wanted a density plot instead? Well, then we set kind equal to "kde".

sns.displot(data = policy_data, x = "real_health_spend_per_cap", kind = "kde")
<seaborn.axisgrid.FacetGrid at 0x7faf4e6124c0>

png

Let’s make a new density plot with our logged outcome variable (log_real_health_spend_per_cap) on the $x$-axis of a density plot. Since we are mainly interested in how financial incentive bans effect this outcome, we’ll also set the hue equal to "fin_incentive_ban":

sns.displot(data = policy_data.dropna(), x = "log_real_health_spend_cap", kind="kde",  hue = "fin_incentive_ban")
<seaborn.axisgrid.FacetGrid at 0x7faf8a556100>

png

Notice that I also chained the .dropna() method onto the DataFrame. That’s because we only dropped the missing values of the outcome variable, so missing values still exist for other variables.

The distributions look fairly similar, and is seems that states with financial incentive bans spend a little bit more. A better way to compare these groups might be a box plot:

sns.boxplot(
    data = policy_data.dropna(),
    x = "fin_incentive_ban",
    y = "log_real_health_spend_cap"
)
/Users/nickjenkins/opt/anaconda3/lib/python3.9/site-packages/pandas/io/formats/format.py:1429: FutureWarning: Index.ravel returning ndarray is deprecated; in a future version this will return a view on self.
  for val, m in zip(values.ravel(), mask.ravel())





<AxesSubplot:xlabel='fin_incentive_ban', ylabel='log_real_health_spend_cap'>

png

Here it definitely looks like states with financial incentive bans spend more.

What does spending look like across states?

sns.boxplot(
    data = policy_data.dropna(),
    x = "log_real_health_spend_cap",
    y = "state"
)
<AxesSubplot:xlabel='log_real_health_spend_cap', ylabel='state'>

png

If we order this plot it will be easier to understand. We want to order the states by health spending, se we need to add the .sort_values() method:

sns.boxplot(
    data = policy_data.sort_values("health_spend_per_cap", ascending = False),
    x = "log_real_health_spend_cap",
    y = "state"
)
<AxesSubplot:xlabel='log_real_health_spend_cap', ylabel='state'>

png

We could add in fin_incentive_ban to see how financial incentive bans affect health spending in different states by mapping fin_incentive_ban to the hue argument:

sns.boxplot(
    data = policy_data.sort_values("health_spend_per_cap", ascending = False),
    x = "log_real_health_spend_cap",
    y = "state",
    hue = "fin_incentive_ban"
)
<AxesSubplot:xlabel='log_real_health_spend_cap', ylabel='state'>

png

This plot is a bit busy, but it stills shows that states with a financial incentive ban spend more. This is especially apparent for the states that have data before and after financial incentive bans.

How does health spending vary by political party and state? To do that, We’ll plot log_real_health_spend_per_cap on the $x$-axis and house_prty_control on the $y$-axis, with a categorical plot where each state gets its own plot.

sns.catplot(
    data = policy_data,
    x = "house_prty_control",
    y = "log_real_health_spend_cap",
    col = "state",
    col_wrap = 5,
    kind = "box"
)
<seaborn.axisgrid.FacetGrid at 0x7faf0fa1c340>

png

It looks like Democrats generally spend more, but that’s not the case for every state.

As a final plot, let’s see how health spending changes over time. To make the plot look like we want, we need to set units = "state" and estimator = None:

sns.lineplot(
    data = policy_data.dropna(),
    x = "year",
    y = "log_real_health_spend_cap",
    units = "state",
    estimator = None
)
<AxesSubplot:xlabel='year', ylabel='log_real_health_spend_cap'>

png

If we decided that we wanted to use this plot in a professional paper, we would need to clean it up a bit. To prevent this guide from getting longer, I’m going to quickly format the plot, but check out the resources listed above for more detailed explanations of plot customization.

sns.set_style("white")


line_plt = sns.lineplot(
    data = policy_data.dropna(),
    x = "year",
    y = "log_real_health_spend_cap",
    units = "state",
    estimator = None,
    size = 0.1,
    legend = False,
    color = "gray"
)

line_plt.set_title("Inflation-Adjusted Health Spending Over Time by State")
line_plt.set_xlabel("Year") 
line_plt.set_ylabel("Real Health Spending Per Capita (Logged)")
line_plt.set_xticks(range(1993, 2010, 1))
line_plt.set_xlim(1993, 2010)
(1993.0, 2010.0)

png

And finally, we can save this figure with the .saveFig() method:

fig = line_plt.get_figure()
fig.savefig("line_plot.pdf")

Going Deeper: We can also make maps!

The process of making maps is fairly straightforward. We need the latitude and longitude for the places that we want to map, and those data need to be joined to the values that we want shown on the map. Let’s start by getting the latitude and longitude coordinates for states. We’ll get the geographic data from the U.S. Census (called a shapefile) and we’ll need to use geopandas to read it into Python.

import geopandas
states = geopandas.read_file("tl_2021_us_state.dbf")
states.head()

REGION DIVISION STATEFP STATENS GEOID STUSPS NAME LSAD MTFCC FUNCSTAT ALAND AWATER INTPTLAT INTPTLON geometry
0 3 5 54 01779805 54 WV West Virginia 00 G4000 A 62266298634 489204185 +38.6472854 -080.6183274 POLYGON ((-80.85847 37.42831, -80.85856 37.428...
1 3 5 12 00294478 12 FL Florida 00 G4000 A 138961722096 45972570361 +28.3989775 -082.5143005 MULTIPOLYGON (((-83.10874 24.62949, -83.10711 ...
2 2 3 17 01779784 17 IL Illinois 00 G4000 A 143778561906 6216493488 +40.1028754 -089.1526108 POLYGON ((-89.17208 37.06831, -89.17296 37.067...
3 2 4 27 00662849 27 MN Minnesota 00 G4000 A 206232627084 18949394733 +46.3159573 -094.1996043 POLYGON ((-92.74568 45.29604, -92.74629 45.295...
4 3 5 24 01714934 24 MD Maryland 00 G4000 A 25151992308 6979074857 +38.9466584 -076.6744939 POLYGON ((-75.76659 39.37756, -75.76630 39.373...

To make the map, we just use the geopandas .plot() method:

states.plot()
<AxesSubplot:>

png

Now we need to merge this data with the policy_data so that we can show the states with financial incentive bans on the map. To do this, we join the tables by matching the variables that are common to both datasets. There are many ways to join two data sets, but we will use an inner join. Visually, it looks like this:

Source: R for Data Science

That is, the inner join retains the rows of data that match between the two data sets. With pandas we can impliment this with the .merge() method, setting the how = argument equal to "inner".

The only other caveat is that we have to tell Python the names of the columns to use to look for matches. In states the column we want is called NAME. In policy_data the column we want is called state. So, we set left_on = "NAME" and right_on = "state":

map_data = states \
    .merge(policy_data, left_on = "NAME", right_on = "state", how = "inner")

map_data.head()

REGION DIVISION STATEFP STATENS GEOID STUSPS NAME LSAD MTFCC FUNCSTAT ... state health_spend_per_cap fin_incentive_ban senate_prty_control house_prty_control house_ideology senate_ideology senate_dem_control real_health_spend_per_cap log_real_health_spend_cap
0 3 5 54 01779805 54 WV West Virginia 00 G4000 A ... West Virginia 2606.0 0.0 Democrat Democrat NaN NaN 0 3002.066079 8.007056
1 3 5 54 01779805 54 WV West Virginia 00 G4000 A ... West Virginia 2858.0 0.0 Democrat Democrat NaN NaN 0 3196.152530 8.069703
2 3 5 54 01779805 54 WV West Virginia 00 G4000 A ... West Virginia 3065.0 0.0 Democrat Democrat NaN NaN 0 3328.017301 8.110132
3 3 5 54 01779805 54 WV West Virginia 00 G4000 A ... West Virginia 3463.0 0.0 Democrat Democrat NaN NaN 0 3666.293522 8.206936
4 3 5 54 01779805 54 WV West Virginia 00 G4000 A ... West Virginia 3643.0 0.0 Democrat Democrat NaN NaN 0 3750.568898 8.229663

5 rows × 26 columns

By merging policy_data to state, the resulting DataFrame is still a geopandas object, which is necessary:

type(map_data)
geopandas.geodataframe.GeoDataFrame

Now, let’s drop Alaska and Hawaii from the data and select the year that we want to make a map for. I’ll use 1996 since that is the first year that any state adopts a ban on financial incentives.

# drop Alaska and Hawaii
map_data = map_data \
    .loc[(map_data.NAME != "Alaska") & (map_data.NAME != "Hawaii")] \
    .loc[map_data.year == 1996]

map = map_data.plot(column = "fin_incentive_ban")
map.set_axis_off()

png

Step 4: Develop a Statistical Model and Analyze your Data

After thoroughly exploring our data, we’re ready to make some inferences. We’ll start with a simple linear regression to test the association between the financial incentive bans and health care spending.

Treatment Indicator Model

To test my research question, I will use the following statistical model:

$$\text{Health Spending}{i} = \alpha + \beta{1} \text{T}{i} + \epsilon{i}$$

where $T$ is the treatment (fin_incentive_ban). However, because I only want to test my research question for one year of data, we will need to limit our data. I’m doing this for pedagogical reasons. The goal of this guide is to walk you through a data project, and not so much how to apply advanced regression models.

We can fit the model in Python with the smf.ols() function and .fit() method provided by statsmodels package.

import statsmodels.api as sm
import statsmodels.formula.api as smf

model_01 = smf.ols(
    "log_real_health_spend_cap ~ fin_incentive_ban",
    data = policy_data.loc[policy_data.year == 1996]
)

fit_01 = model_01.fit()
print(fit_01.summary())

                                OLS Regression Results                               
=====================================================================================
Dep. Variable:     log_real_health_spend_cap   R-squared:                       0.035
Model:                                   OLS   Adj. R-squared:                  0.015
Method:                        Least Squares   F-statistic:                     1.737
Date:                       Thu, 21 Apr 2022   Prob (F-statistic):              0.194
Time:                               17:04:32   Log-Likelihood:                 39.098
No. Observations:                         50   AIC:                            -74.20
Df Residuals:                             48   BIC:                            -70.37
Df Model:                                  1                                         
Covariance Type:                   nonrobust                                         
============================================================================================
                               coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------
Intercept                    8.1031      0.017    486.410      0.000       8.070       8.137
fin_incentive_ban[T.1.0]     0.0776      0.059      1.318      0.194      -0.041       0.196
==============================================================================
Omnibus:                        1.297   Durbin-Watson:                   2.000
Prob(Omnibus):                  0.523   Jarque-Bera (JB):                0.660
Skew:                           0.249   Prob(JB):                        0.719
Kurtosis:                       3.262   Cond. No.                         3.71
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Our model says that adopting policies that ban financial incentives for doctors increases a state’s per capita spending on health care by about $100 \times (e^{0.0776} - 1) = $ 8.07 percent. Not a massive effect, and notice that it’s very imprecisely estimated (the standard error is large relative to the coefficient estimate).

Now the model with controls:

$$\text{Health Spending}{i} = \alpha + \beta{1} \text{T}{i} + \beta \text{Dem House}{i} + \beta_{3} \text{Dem Senate}i + \beta{4} \text{Senate Ideo}{i} + \beta{5} \text{House Ideo}{i} + \epsilon{i}$$

model_02 = smf.ols(
    "log_real_health_spend_cap ~ fin_incentive_ban + house_prty_control + senate_prty_control + house_ideology + senate_ideology",
    data = policy_data.loc[policy_data.year == 1996]
)

fit_02 = model_02.fit()
print(fit_02.summary())
                                OLS Regression Results                               
=====================================================================================
Dep. Variable:     log_real_health_spend_cap   R-squared:                       0.524
Model:                                   OLS   Adj. R-squared:                  0.443
Method:                        Least Squares   F-statistic:                     6.429
Date:                       Thu, 21 Apr 2022   Prob (F-statistic):           0.000123
Time:                               17:04:32   Log-Likelihood:                 47.149
No. Observations:                         42   AIC:                            -80.30
Df Residuals:                             35   BIC:                            -68.13
Df Model:                                  6                                         
Covariance Type:                   nonrobust                                         
===================================================================================================
                                      coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------------------
Intercept                           8.1539      0.025    325.421      0.000       8.103       8.205
fin_incentive_ban[T.1.0]            0.0715      0.049      1.465      0.152      -0.028       0.171
house_prty_control[T.Split]      5.589e-16   1.57e-17     35.506      0.000    5.27e-16    5.91e-16
house_prty_control[T.Democrat]     -0.0982      0.052     -1.900      0.066      -0.203       0.007
senate_prty_control[T.Split]       -0.0912      0.096     -0.950      0.349      -0.286       0.104
senate_prty_control[T.Democrat]     0.0284      0.049      0.578      0.567      -0.071       0.128
house_ideology                     -0.2521      0.057     -4.450      0.000      -0.367      -0.137
senate_ideology                     0.0769      0.056      1.372      0.179      -0.037       0.191
==============================================================================
Omnibus:                       11.066   Durbin-Watson:                   1.901
Prob(Omnibus):                  0.004   Jarque-Bera (JB):               10.628
Skew:                          -1.066   Prob(JB):                      0.00492
Kurtosis:                       4.237   Cond. No.                     9.35e+17
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 8.01e-35. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

Notice how much the point estimate slightly decreases after including controls. Now the model is telling us that a when a state adopts policies that ban financial incentives for doctors increases a state’s per capita spending on health care by about 7.42 percent.

Let’s make a table for these results with stargazer.

from stargazer.stargazer import Stargazer

results_table = Stargazer([fit_01, fit_02])
results_table.rename_covariates(
    {"fin_incentive_ban[T.1.0]": "Financial Incentive Ban",
    "house_ideology": "House Ideology",
    "house_prty_control[T.Democrat]": "Democrat Controlled House",
    "house_prty_control[T.Split]": "Split Party Controlled House",
    "senate_ideology": "Senate Ideology",
    "senate_prty_control[T.Democrat]": "Democrat Controlled Senate",
    "senate_prty_control[T.Split]": "Split Party Controlled Senate"}
)
results_table
Dependent variable:log_real_health_spend_cap
(1)(2)
Intercept8.103***8.154***
(0.017)(0.025)
Financial Incentive Ban0.0780.072
(0.059)(0.049)
House Ideology-0.252***
(0.057)
Democrat Controlled House-0.098*
(0.052)
Split Party Controlled House0.000***
(0.000)
Senate Ideology0.077
(0.056)
Democrat Controlled Senate0.028
(0.049)
Split Party Controlled Senate-0.091
(0.096)
Observations5042
R20.0350.524
Adjusted R20.0150.443
Residual Std. Error0.113 (df=48)0.086 (df=35)
F Statistic1.737 (df=1; 48)6.429*** (df=6; 35)
Note: *p<0.1; **p<0.05; ***p<0.01

Differences-in-differences Model

With differences-in-differences estimation, our goal is to control for differences in the treatment and control groups by looking only at how they change over time. A simple estimation of this uses this formula:

$$ \text{DiD Estimate} = \text{E}[Y^{after}{treated} - Y^{after}{control}] - \text{E}[Y^{before}{treated} - Y^{before}{control}] $$

This gives us the difference between the treatment and control groups after the intervention ($\text{E}[Y^{after}{treated} - Y^{after}{control}]$) minus the difference between the treatment and the control group before the intervention ($\text{E}[Y^{before}{treatment} - Y^{before}{control}]$). Now let’s calculate this with our data.

Before building the model, however, we are going to restrict our data to just six states. Three that adopted the policy in 1996 (California, Florida, and Maryland) and three that never adopted the policy (Virginia, Colorado, and Washington). This will make things easier, as I will explain later. Note that the | operator means “or.”

diff_data = policy_data \
    .loc[
        (policy_data.state == "California") | (policy_data.state == "Colorado") |
        (policy_data.state == "Florida") | (policy_data.state == "Maryland") |
        (policy_data.state == "Washington") | (policy_data.state == "Virginia")
    ]

Now let’s add a variable to identify which states are affected by the treatment and the first year that they received the treatment.

diff_data = diff_data \
    .assign(treatment_group = np.where(
        (diff_data.state == "California") | (diff_data.state == "Florida") |
        (diff_data.state == "Maryland"), 1, 0
    ))

Now we have a variable named treatment_group that tells us which states are in the treatment group.

Let’s review the important elements of our dataset. We have a column called fin_incentive_ban that tells us the year when states in the treatment group received the treatment and we have a column called treatment_group that tells us which states are in the treatment group and which are in the control group.

I limited my data to just California and Colorado, because California adopts the treatment in 1996, but Colorado never does. We could include all states, but that introduces complexity since not every state adopts the treatment in the same year. This leads to what is called a staggered difference in difference design. There have been a plethora of papers published on staggered difference in difference designs in the last 5 years - so these methods are rapidly changing. See How much should we trust staggered difference-in-differences estimates? as a primer for how they work and their potential problems. Instead of exploring that, I’m just going to compare two states.

Before we calculate the DiD estimate, and since we have multiple pre-treatment periods, let’s check the parallel trends assumption:

trends_plt = sns.lineplot(
    data = diff_data,
    x = "year",
    y = "log_real_health_spend_cap",
    hue = "fin_incentive_ban",
    style = "state",
    units = "state",
    estimator = None
)

trends_plt.set_xlabel("Year")
trends_plt.set_ylabel("Real Health Spending per Capita")
trends_plt.set_xticks(range(1992, 2009, 1))
trends_plt.set_xlim(1993, 2010)
trends_plt.axvline(x = 1996, color = "black", linestyle = "--")
trends_plt.annotate("Policy Adoption" , xy = (1996.5, 8.5))
Text(1996.5, 8.5, 'Policy Adoption')

png

Those trends really aren’t parallel. Why does it matter? Well, if health spending in California was already trending up, as compared to Texas, before the treatment began then any differences after the treatment are probably due to whatever was causing those previous trends. In fact, this plot shows that after California adopts the treatment, both stats follow similar trends. This indicates that the policy doesn’t really affect health spending.

Now we’re ready to calculate the DiD estimate:

diff_data \
    .groupby(["treatment_group", "fin_incentive_ban"]) \
    .aggregate({"log_real_health_spend_cap" : "mean"})

log_real_health_spend_cap
treatment_group fin_incentive_ban
0 0.0 8.170823
1.0 NaN
1 0.0 8.123392
1.0 8.334788

Row one shows the mean health spending for the control group before the treatment period, row three shows the mean health spending for the treatment group before the treatment period, and row four shows the mean health spending for the treatment group after the treatment takes place. Why did we get a NaN value in row two? Go back and look through data for members of the control group. What is .loc[fin_incentive_ban == "Yes"] for states in the control group?

There are no states in the control group that have a value of “Yes” in the fin_incentive_ban column. That means that we can’t calculate $\text{E}[Y^{after}_{control}]$, or more technically, $\text{E}[Y(T = 0 | D = 1)]$ where $D$ indicates treatment period and $T$ indicates group assignment.

How can we fix this? All we have to do is create a column that tells us when the treatment period starts (which is 1996 for our data):

diff_data = diff_data \
    .assign(treatment_period = np.where(diff_data.year >= 1996, 1, 0))

Now let’s look at California and Colorado, a treatment and control members, to see what is going with these columns:

diff_data \
    .loc[:, ["year", "state", "fin_incentive_ban", "treatment_group", "treatment_period"]] \
    .sort_values(["state", "year"])

year state fin_incentive_ban treatment_group treatment_period
4645 1991 California 0.0 1 0
4696 1992 California 0.0 1 0
4747 1993 California 0.0 1 0
4798 1994 California 0.0 1 0
4849 1995 California 0.0 1 0
... ... ... ... ... ...
5402 2005 Washington 0.0 0 1
5453 2006 Washington 0.0 0 1
5504 2007 Washington 0.0 0 1
5555 2008 Washington 0.0 0 1
5606 2009 Washington 0.0 0 1

114 rows × 5 columns

For California, it doesn’t look like there is any difference between the fin_incentive_ban and treatment_period columns. But for Colorado, treatment_period gets a value of 1 in 1996 even though Colorado never receives the treatment.

With the treatment_group and the treatment_period columns, we have all the ingredients that we need. Now let’s finish calculating the post-treatment differences and the DiD estimate.

diff_values = diff_data \
    .groupby(["treatment_group", "treatment_period"]) \
    .aggregate({"log_real_health_spend_cap": "mean"}) \
    .sort_values("treatment_period") \
    .reset_index()
diff_values

treatment_group treatment_period log_real_health_spend_cap
0 0 0 7.980795
1 1 0 8.123392
2 0 1 8.238691
3 1 1 8.334788
# DiD estimate
(diff_values.iloc[3, 2] - diff_values.iloc[2, 2]) - (diff_values.iloc[1, 2] - diff_values.iloc[0, 2])
-0.046498870741971565

Looks like the treatment has a negative effect on the log of health spending per capita. Specifically, when a state adopts policies that ban financial incentives for doctors decreases a state’s per capita spending on health care by about 6 percentage points.

It’s great that we have the DiD estimate, but if we want to conduct a hypothesis test we’ll need to calculate its standard errors. The easiest way to do this is to run a regression. The specification we need to use is this:

$$ \log(\text{Health Spending}){it} = \alpha + \beta{1} \text{T}{it} + \beta{2} \text{G}{it} + \beta{3}\text{T}{it} \times \text{G}{it} + \epsilon_{it} $$

where $T$ is the treatment_period, $G$ is treatment_group, and $\beta_{3}$ is estimate for the interaction between treatment_period and treatment_group (this is the DiD estimate). Let’s run it!

did_model = smf.ols(
    "log_real_health_spend_cap ~ treatment_period * treatment_group",
    data = diff_data
)

did_fit = did_model.fit()

print(did_fit.summary())
                                OLS Regression Results                               
=====================================================================================
Dep. Variable:     log_real_health_spend_cap   R-squared:                       0.451
Model:                                   OLS   Adj. R-squared:                  0.436
Method:                        Least Squares   F-statistic:                     30.08
Date:                       Thu, 21 Apr 2022   Prob (F-statistic):           2.80e-14
Time:                               17:04:33   Log-Likelihood:                 71.446
No. Observations:                        114   AIC:                            -134.9
Df Residuals:                            110   BIC:                            -123.9
Df Model:                                  3                                         
Covariance Type:                   nonrobust                                         
====================================================================================================
                                       coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------------------
Intercept                            7.9808      0.034    234.830      0.000       7.913       8.048
treatment_period                     0.2579      0.040      6.514      0.000       0.179       0.336
treatment_group                      0.1426      0.048      2.967      0.004       0.047       0.238
treatment_period:treatment_group    -0.0465      0.056     -0.830      0.408      -0.157       0.064
==============================================================================
Omnibus:                        2.650   Durbin-Watson:                   0.434
Prob(Omnibus):                  0.266   Jarque-Bera (JB):                1.883
Skew:                          -0.118   Prob(JB):                        0.390
Kurtosis:                       2.416   Cond. No.                         9.61
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Check out the estimate for our interaction treatment_period:treatment_group! Exactly what we calculated. Unfortunately, however, the treatment doesn’t seem to have any effect. This is not a great differences in difference model because the assumptions don’t hold, but hopefully it showed you how diff-in-diff estimation works. Let’s add this model to our results table.

results_table = Stargazer([fit_01, fit_02, did_fit])
results_table.rename_covariates(
    {"fin_incentive_ban[T.1.0]": "Financial Incentive Ban",
    "house_ideology": "House Ideology",
    "house_prty_control[T.Democrat]": "Democrat Controlled House",
    "house_prty_control[T.Split]": "Split Party Controlled House",
    "senate_ideology": "Senate Ideology",
    "senate_prty_control[T.Democrat]": "Democrat Controlled Senate",
    "senate_prty_control[T.Split]": "Split Party Controlled Senate",
    "treatment_period:treatment_group": "Diff-in-Diff Estimate",
    "treatment_group": "Treatment Group",
    "treatment_period": "Treatment Period"}
)
results_table
Dependent variable:log_real_health_spend_cap
(1)(2)(3)
Intercept8.103***8.154***7.981***
(0.017)(0.025)(0.034)
Financial Incentive Ban0.0780.072
(0.059)(0.049)
House Ideology-0.252***
(0.057)
Democrat Controlled House-0.098*
(0.052)
Split Party Controlled House0.000***
(0.000)
Senate Ideology0.077
(0.056)
Democrat Controlled Senate0.028
(0.049)
Split Party Controlled Senate-0.091
(0.096)
Treatment Group0.143***
(0.048)
Treatment Period0.258***
(0.040)
Diff-in-Diff Estimate-0.046
(0.056)
Observations5042114
R20.0350.5240.451
Adjusted R20.0150.4430.436
Residual Std. Error0.113 (df=48)0.086 (df=35)0.132 (df=110)
F Statistic1.737 (df=1; 48)6.429*** (df=6; 35)30.082*** (df=3; 110)
Note: *p<0.1; **p<0.05; ***p<0.01

Summary

Wow! That was a lot of information! Remember, the goal of this guide is show you how to work through all the steps of a data project so that you can complete a high-quality project of your own. If you read this straight through, well done! Chances are thought, that this will work better as a reference to help you solve problems that you encounter with your projects.

Previous