MPP Bootcamp

September 20/21, 2021

Introduction

Hello!

5th year grad student in Political Science
MA in economics from CSULB
My research focuses on how money influences politics
I love data science!
I’ll be working with you all year

Boot Camp Outline

The goal of this boot camp is to introduce you to some of the basic principles of working with data and coding

How to download data
How to manipulate data
How to visualize data

I want to get you prepared to start your first quantitative course in the MPP program!

This is a “Hands-on” Boot Camp

I want this to be more of an “Intro to Data Analysis with Some Code” instead of an “Intro to Coding” boot camp

I use R, but your quant instructor uses Python …
Both languages are great for data science
I’ll try to cover the very basics of R, which you will see are very similar to Python

What’s the Big Deal About Quant Methods?

They let us produce good answers to important questions:

Easier questions:

What states spend the most money on healthcare?
What percentage of Republicans voted for Biden?

Harder questions:

Does healthcare spending increase health outcomes?
Do job training programs reduce unemployment?

Working with Data is a Process

Examples of Clean Data

                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

What does each row describe?
Can you tell what the values in each column represent?

Examples of Clean Data

   firm year    inv  value capital
19    1 1953 1304.4 6241.7  1777.3
20    1 1954 1486.7 5593.6  2226.3
39    2 1953  641.0 2031.3   623.6
40    2 1954  459.3 2115.5   669.7
59    3 1953  179.5 2371.6   800.3
60    3 1954  189.6 2759.9   888.9

What does each row describe?
Can you tell what the values in each column represent?

Examples of Messy Data

##             x         y z
## [1,] 3.791744 0.4206294 0
## [2,] 6.087617 0.3604956 1
## [3,] 2.402151 0.7015900 0
## [4,] 5.493115 0.2953055 1
## [5,] 4.770244 0.7497821 0
## [6,] 4.116415 0.6081253 1

What does each row describe?
Can you tell what the values in each column represent?

Examples of Messy Data

   CPO1976   DCO9679    z t
1 3.791744 0.4206294 -999 4
2 6.087617 0.3604956    1 6
3 2.402151 0.7015900 -999 2
4 5.493115 0.2953055    1  
5 4.770244 0.7497821 -999  
6 4.116415 0.6081253    1 4

What does each row describe?
Can you tell what the values in each column represent?

Messy Data is Bad

Data should be meaningful (or it’s not useful)
We need to be able to understand what the data tells us when we look at it
- What do the rows describe?
- What does each value mean?

Exploratory Analysis

Once we understand our data we can explore it
Let’s use an example
- Suppose we want to know about what predicts a car’s miles per gallon in the city
- To explore this question, we’ll use data on vehicle mpg:

# A tibble: 6 × 11
  manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
  <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa…
2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa…
3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa…
4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa…
5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p     compa…
6 audi         a4      2.8  1999     6 manual(m5) f        18    26 p     compa…

Exploratory Analysis

How many different manufacturers do we have in our data?
- 15
What are the oldest and youngest cars in our data?
- 1999, 2008

Exploratory Analysis

What are the average city and highway MPG ratings?

      cty             hwy       
 Min.   : 9.00   Min.   :12.00  
 1st Qu.:14.00   1st Qu.:18.00  
 Median :17.00   Median :24.00  
 Mean   :16.86   Mean   :23.44  
 3rd Qu.:19.00   3rd Qu.:27.00  
 Max.   :35.00   Max.   :44.00

Exploratory Analysis

Modeling

In the modeling stage we estimate formal relationships between variables
We can also make predictions about future data

Boot Camp Learning Objectives

If you give an honest effort to solve each practice problem, I promise that you will be able to do the follow at the end of the workshop:

Understand the principles of tidy data
Load a data set in R and clean it for analysis
Summarize and visualize data