September 20/21, 2021

Introduction

Hello!

  • 5th year grad student in Political Science

  • MA in economics from CSULB

  • My research focuses on how money influences politics

  • I love data science!

  • I’ll be working with you all year

Boot Camp Outline

The goal of this boot camp is to introduce you to some of the basic principles of working with data and coding

  • How to download data

  • How to manipulate data

  • How to visualize data

I want to get you prepared to start your first quantitative course in the MPP program!

This is a “Hands-on” Boot Camp

I want this to be more of an “Intro to Data Analysis with Some Code” instead of an “Intro to Coding” boot camp

  • I use R, but your quant instructor uses Python …

  • Both languages are great for data science

  • I’ll try to cover the very basics of R, which you will see are very similar to Python

What’s the Big Deal About Quant Methods?

  • They let us produce good answers to important questions:

    Easier questions:

  1. What states spend the most money on healthcare?
  2. What percentage of Republicans voted for Biden?

Harder questions:

  1. Does healthcare spending increase health outcomes?
  2. Do job training programs reduce unemployment?

Working with Data is a Process

Examples of Clean Data

                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
  1. What does each row describe?
  2. Can you tell what the values in each column represent?

Examples of Clean Data

   firm year    inv  value capital
19    1 1953 1304.4 6241.7  1777.3
20    1 1954 1486.7 5593.6  2226.3
39    2 1953  641.0 2031.3   623.6
40    2 1954  459.3 2115.5   669.7
59    3 1953  179.5 2371.6   800.3
60    3 1954  189.6 2759.9   888.9
  1. What does each row describe?
  2. Can you tell what the values in each column represent?

Examples of Messy Data

##             x         y z
## [1,] 3.791744 0.4206294 0
## [2,] 6.087617 0.3604956 1
## [3,] 2.402151 0.7015900 0
## [4,] 5.493115 0.2953055 1
## [5,] 4.770244 0.7497821 0
## [6,] 4.116415 0.6081253 1
  1. What does each row describe?
  2. Can you tell what the values in each column represent?

Examples of Messy Data

   CPO1976   DCO9679    z t
1 3.791744 0.4206294 -999 4
2 6.087617 0.3604956    1 6
3 2.402151 0.7015900 -999 2
4 5.493115 0.2953055    1  
5 4.770244 0.7497821 -999  
6 4.116415 0.6081253    1 4
  1. What does each row describe?
  2. Can you tell what the values in each column represent?

Messy Data is Bad

  • Data should be meaningful (or it’s not useful)

  • We need to be able to understand what the data tells us when we look at it

    • What do the rows describe?
    • What does each value mean?

Exploratory Analysis

  • Once we understand our data we can explore it

  • Let’s use an example

    • Suppose we want to know about what predicts a car’s miles per gallon in the city

    • To explore this question, we’ll use data on vehicle mpg:

# A tibble: 6 × 11
  manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
  <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa…
2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa…
3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa…
4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa…
5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p     compa…
6 audi         a4      2.8  1999     6 manual(m5) f        18    26 p     compa…

Exploratory Analysis

  • How many different manufacturers do we have in our data?

    • 15
  • What are the oldest and youngest cars in our data?

    • 1999, 2008

Exploratory Analysis

  • What are the average city and highway MPG ratings?
      cty             hwy       
 Min.   : 9.00   Min.   :12.00  
 1st Qu.:14.00   1st Qu.:18.00  
 Median :17.00   Median :24.00  
 Mean   :16.86   Mean   :23.44  
 3rd Qu.:19.00   3rd Qu.:27.00  
 Max.   :35.00   Max.   :44.00  

Exploratory Analysis

Exploratory Analysis

Exploratory Analysis

Modeling

  • In the modeling stage we estimate formal relationships between variables

  • We can also make predictions about future data

Boot Camp Learning Objectives

If you give an honest effort to solve each practice problem, I promise that you will be able to do the follow at the end of the workshop:

  1. Understand the principles of tidy data

  2. Load a data set in R and clean it for analysis

  3. Summarize and visualize data