Hello!
5th year grad student in Political Science
MA in economics from CSULB
My research focuses on how money influences politics
I love data science!
I’ll be working with you all year
September 20/21, 2021
Hello!
5th year grad student in Political Science
MA in economics from CSULB
My research focuses on how money influences politics
I love data science!
I’ll be working with you all year
The goal of this boot camp is to introduce you to some of the basic principles of working with data and coding
How to download data
How to manipulate data
How to visualize data
I want to get you prepared to start your first quantitative course in the MPP program!
I want this to be more of an “Intro to Data Analysis with Some Code” instead of an “Intro to Coding” boot camp
I use R, but your quant instructor uses Python …
Both languages are great for data science
I’ll try to cover the very basics of R, which you will see are very similar to Python
They let us produce good answers to important questions:
Easier questions:
Harder questions:
mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
firm year inv value capital 19 1 1953 1304.4 6241.7 1777.3 20 1 1954 1486.7 5593.6 2226.3 39 2 1953 641.0 2031.3 623.6 40 2 1954 459.3 2115.5 669.7 59 3 1953 179.5 2371.6 800.3 60 3 1954 189.6 2759.9 888.9
## x y z ## [1,] 3.791744 0.4206294 0 ## [2,] 6.087617 0.3604956 1 ## [3,] 2.402151 0.7015900 0 ## [4,] 5.493115 0.2953055 1 ## [5,] 4.770244 0.7497821 0 ## [6,] 4.116415 0.6081253 1
CPO1976 DCO9679 z t 1 3.791744 0.4206294 -999 4 2 6.087617 0.3604956 1 6 3 2.402151 0.7015900 -999 2 4 5.493115 0.2953055 1 5 4.770244 0.7497821 -999 6 4.116415 0.6081253 1 4
Data should be meaningful (or it’s not useful)
We need to be able to understand what the data tells us when we look at it
Once we understand our data we can explore it
Let’s use an example
Suppose we want to know about what predicts a car’s miles per gallon in the city
To explore this question, we’ll use data on vehicle mpg:
# A tibble: 6 × 11 manufacturer model displ year cyl trans drv cty hwy fl class <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr> 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa… 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa… 3 audi a4 2 2008 4 manual(m6) f 20 31 p compa… 4 audi a4 2 2008 4 auto(av) f 21 30 p compa… 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compa… 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compa…
How many different manufacturers do we have in our data?
What are the oldest and youngest cars in our data?
cty hwy Min. : 9.00 Min. :12.00 1st Qu.:14.00 1st Qu.:18.00 Median :17.00 Median :24.00 Mean :16.86 Mean :23.44 3rd Qu.:19.00 3rd Qu.:27.00 Max. :35.00 Max. :44.00
In the modeling stage we estimate formal relationships between variables
We can also make predictions about future data
If you give an honest effort to solve each practice problem, I promise that you will be able to do the follow at the end of the workshop:
Understand the principles of tidy data
Load a data set in R and clean it for analysis
Summarize and visualize data