Data Cleaning

We’ll be using R Studio for our workshop because it provides convenient support for both R and Python. It also provides an awesome visual markdown editor which allows you to see your markdown syntax compiled in real time.

Markdown is arguably THE best tool for reporting results and we will be making use of it during this workshop series. In the final workshop, we’ll learn how we can use markdown to write dynamic documents, which are documents where you write up the results of your research with your code to produce all figures, tables, and models at the same time. Writing the code and prose together makes it easy to modify your report if the analysis changes, minimizes errors, and it also encourages reproducibility since readers can find your code in the body of the text. But, all that is for another day. For now, let’s focus on data exploration.

Session Learning Objectives

Work with RMarkdown to write code and report results
Learn the basics of R and refresh your Python skills
Clean a dataset for analysis
- Handle missing values
- Rename variables
- Create variables
- Pivot Data
- Separate Data
Learn the “tidy data” phlisophy

(R)Markdown Refresher

You create the biggest header using the # sign.

This is a smaller header

This is an even smaller header

… you get it

This makes a bullet point
What about a numbered list?

Just use numbers…

Viewing Markdown

How can we see what this will look like when it’s rendered? RStudio gives us two options. First, we can Knit the document by clicking the button on the tool bar that looks like this:

Knitting to an HTML file is easiest, but as you can see, if you click on the drop down arrow to the right of Knit, you get more output options. Knit works by calling Pandoc on the backend to render your markdown file into the output that you want.

The second way to see what your code will look like is to use R Studio’s dynamic markdown editor. You turn this on by clicking the button on the far right of the tool bar that looks like an “A”:

The last markdown tool you should know about for now, is that RStudio lets you see the document outline which can be helpful when writing a long document. To see it, click the button just to the left of the dynamic editor button:

Onwards!

To start a codeblock (where you would write code), you use three tick marks ``` followed by the language you want to write in curly brackets {} like this: ```{r}. To close the block, you enter three tick marks on the line below (note that you have to have a space in between the start of the block and the end) ```.

Instead of adding a block manually, you can also use the little green insert button on the tool bar:

Learning R and Refreshing Python

R is just a really fancy calculator. Python is more general, but it can still preform the same calculations that R can. Let’s start with some basic math in Python. To start, insert a Python code chunk:

# let's start with addition
1 + 1

## 2

5 + 2

# notice the spaces between the + operator. This is considered good grammar. 
# It's a lot harder to read without spaces (it gets worse when we have 
# complex code).

## 7

1+1 # bad

## 2

5+2 # bad

# multiplication

## 7

4 * 8

## 32

5 * 19

# division

## 95

4 / 8

## 0.5

14 / 7

# power functions

## 2.0

4 ** 2

## 16

Now let’s reproduce this in R:

# addition
1 + 1

## [1] 2

5 + 2

## [1] 7

# multiplication
4 * 8

## [1] 32

5 * 19

## [1] 95

# division
4 / 8

## [1] 0.5

14 / 7

## [1] 2

# cubes
4 ^ 2

## [1] 16

Everything was the same with the exception of the power function at the end. R using ^ and Python uses **.

The answers to each problem are shown at the bottom of the code block, but what if we wanted to save the answer to retrieve it later? This is where to concept of assignment comes in. When we make an assignment, we are creating an object that holds whatever we tell it to. In Python it looks like this:

answer = 5 + 2

# again, note the spacing above. The spacing below is bad: 
answer=5+2 # bad

Now, if we want to see the answer to our operation, we just print the object we saved it to: answer:

answer

## 7

Assignment works nearly the same way in R. The main difference is with the assignment operator. While the = works for assignment in R, it’s considered bad practice to use it for assignment. Instead we use <- when making an assignment.

answer <- 5 + 2
answer

## [1] 7

The second different is that when making an assignment, R creates a copy

If you look in the environment pane, you will see the object (answer) that we created. You might also notice that the top left corner of the environment pane says R. If you click that, you’ll see that you can also select Python to see all of the Python objects you’ve created.

Exercises

Insert a Python code chunk and calculate $17 \times 6$. Assign your answer to an object named answer2.
Now do the same in R.

Lists and Vectors

What if we wanted to combine several operations in one object? In Python, we accomplish this by creating a list:

a = [1, 2, 3]
a

## [1, 2, 3]

In R, we create a list with the c function and parentheses ().

a <- c(1, 2, 3)
a

## [1] 1 2 3

# we can include text instead of numbers and operations. Put text in quotations
text <- c("This", "Shows", "That we can also have text")
print(text)

## [1] "This"                       "Shows"                     
## [3] "That we can also have text"

In R, we mainly refer to these as vectors. Like Python, lists can contain text, text and numbers, or operations

text <- c("This", "Shows", "That we can also have text")
text

## [1] "This"                       "Shows"                     
## [3] "That we can also have text"

b <- c(2, 5, "Text")
b

## [1] "2"    "5"    "Text"

c <- c(3 + 3, 4 * 8, 17)
c

## [1]  6 32 17

Types of Variables

When you work with data, chances are you’ll working with many different types of data at the same time. Consider this pandas DataFrame:

import pandas as pd

data = pd.DataFrame(
  {"age": [33, 24, 42, 20, 19, 23],
   "education": ["High School", "Bachelors", "Bachelors", "High School", 
                 "Some College", "Grad School"],
   "surname": ["smith", "gonzales", "lee", "doe", "kim", "swanson"]}
)

data

##    age     education   surname
## 0   33   High School     smith
## 1   24     Bachelors  gonzales
## 2   42     Bachelors       lee
## 3   20   High School       doe
## 4   19  Some College       kim
## 5   23   Grad School   swanson

Here we have three different types of data. age is clearly numeric and education and surname are strings. But, is there a difference between education and surname? education is best considered a categorical variable because each level of education is a different category. surname of the other hand isn’t really categorical, it’s just text.

We can check the data types of the DataFrame in Python like this:

data.dtypes

## age           int64
## education    object
## surname      object
## dtype: object

For reasons that I won’t explain here, pandas uses the type “object” for strings, so this is correct. What about with R? To recreate this data frame in R we use the data.frame function:

r.data <- data.frame(
  age = c(33, 24, 42, 20, 19, 23),
  education = c("High School", "Bachelors", "Bachelors", "High School", 
                "Some College", "Grad School"),
  surname = c("smith", "gonzales", "lee", "doe", "kim", "swanson")
)

head(r.data)

##   age    education  surname
## 1  33  High School    smith
## 2  24    Bachelors gonzales
## 3  42    Bachelors      lee
## 4  20  High School      doe
## 5  19 Some College      kim
## 6  23  Grad School  swanson

The head() function shows the first 6 rows of data and at the top of each column you see the data type. It shows that age is of type <dbl> which stands for double and basically means that it’s numeric data. education and surname are of type <chr> which means character, or string.

Earlier, I said that education should be a category, so why isn’t it? R and Python can’t really recognize the difference between strings and categories so we have to manually set it.

In Python, we can convert education to a categorical type with the .astype("category") method:

data["education"].astype("category")

## 0     High School
## 1       Bachelors
## 2       Bachelors
## 3     High School
## 4    Some College
## 5     Grad School
## Name: education, dtype: category
## Categories (4, object): ['Bachelors', 'Grad School', 'High School', 'Some College']

Now in R:

r.data["education"] <- as.factor(r.data$education)
head(r.data)

##   age    education  surname
## 1  33  High School    smith
## 2  24    Bachelors gonzales
## 3  42    Bachelors      lee
## 4  20  High School      doe
## 5  19 Some College      kim
## 6  23  Grad School  swanson

In this R block, we used a new operator $. This operator allows us to access columns of a data frame. In an R block, try typing: r.data$ then hit the Tab key. All of the columns are listed out. You can use the arrow keys to navigate to the one you want and hit Tab again.

Setting your data types correctly is an important aspect of data cleaning because it will make data visualization and analysis much easier down the line.

You can convert (also known as casting or coercing) a data frame column to other types data as well. These include numeric, character, integers, datetimes and more.

Exercises

Use R to coerce the age variable in r.data to an integer.
Use Python cast the age variable as an integer. (Hint: the integer type in Python is int64)

Getting Help!

Like the Python, the help options in R are very useful. If you place a ? before any function, it will pull up a help menu:

?mean()

Exercise

Create this vector in R: vector <- c(123, 45, NA, 78, 17), the find the mean.

Clean a Dataset for Analysis

Data cleaning (munging, or wrangling) is the primary task of any data science project. Data is really great for improving decisions, but only if it is in a usable format and only if we know what the data is measuring. There are a lot of different tasks involved in data cleaning, and in this workshop we will focus on three of the most important tasks:

Handling missing data
Renaming variables
Creating dummy variables

To see all of these tasks in action, lets get started with an actual dataset. The dataset that we’ll being using for the remainder of this workshop is about flights to and from various destinations. Let’s load it in with Python:

flights_data = pd.read_csv("https://raw.githubusercontent.com/vaibhavwalvekar/NYC-Flights-2013-Dataset-Analysis/master/flights.csv")

flights_data.head()

##    Unnamed: 0  year  month  day  ...  air_time  distance  hour  minute
## 0           1  2013      1    1  ...     227.0      1400   5.0    17.0
## 1           2  2013      1    1  ...     227.0      1416   5.0    33.0
## 2           3  2013      1    1  ...     160.0      1089   5.0    42.0
## 3           4  2013      1    1  ...     183.0      1576   5.0    44.0
## 4           5  2013      1    1  ...     116.0       762   5.0    54.0
## 
## [5 rows x 17 columns]

Note that would can also load this data into R in a very similar way. First we’ll load the tidyverse package, then load the data:

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.8
## ✓ tidyr   1.2.0     ✓ stringr 1.4.0
## ✓ readr   2.1.2     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

flights <- read_csv("https://raw.githubusercontent.com/vaibhavwalvekar/NYC-Flights-2013-Dataset-Analysis/master/flights.csv")

## New names:
## * `` -> ...1

## Rows: 336776 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (4): carrier, tailnum, origin, dest
## dbl (13): ...1, year, month, day, dep_time, dep_delay, arr_time, arr_delay, ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(flights)

## # A tibble: 6 × 17
##    ...1  year month   day dep_time dep_delay arr_time arr_delay carrier tailnum
##   <dbl> <dbl> <dbl> <dbl>    <dbl>     <dbl>    <dbl>     <dbl> <chr>   <chr>  
## 1     1  2013     1     1      517         2      830        11 UA      N14228 
## 2     2  2013     1     1      533         4      850        20 UA      N24211 
## 3     3  2013     1     1      542         2      923        33 AA      N619AA 
## 4     4  2013     1     1      544        -1     1004       -18 B6      N804JB 
## 5     5  2013     1     1      554        -6      812       -25 DL      N668DN 
## 6     6  2013     1     1      554        -4      740        12 UA      N39463 
## # … with 7 more variables: flight <dbl>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>

Of course, the read functions in Python and R also accept traditional file paths and file names instead of URLs like this:

read_csv("Documents/my_project/data.csv")

And, you can read in many other types of data with Python and R. Here is a table that shows the functions needed to read in other common file types:

Functions for Reading in Different Files
File Type	R Function	Python Function
CSV	`readr::read_csv`	`pandas.read_csv`
Excel (`.xls` & `.xlsx`)	`readxl::read_excel`	`pandas.read_excel`
SPSS	`haven::read_spss`	`pandas.read_spss`
Stata	`haven::read_dta`	`pandas.read_stata`
SAS	`haven::read_sas`	`pandas.read_sas`
Feather	`feather::read_feather`	`pandas.read_feather`

The beauty of using a tool like RStudio is that we don’t have to only use R or only use Python. RStudio lets us use both languages to work with the same data. This is made possible by the reticulate package in R. After we load it in, we can use R and Python to interact with the same data. This is how it works:

library(reticulate) # load the reticulate package

# to pull data loaded with Python in R, just use the preface py$
head(py$flights_data)

##   Unnamed: 0 year month day dep_time dep_delay arr_time arr_delay carrier
## 1          1 2013     1   1      517         2      830        11      UA
## 2          2 2013     1   1      533         4      850        20      UA
## 3          3 2013     1   1      542         2      923        33      AA
## 4          4 2013     1   1      544        -1     1004       -18      B6
## 5          5 2013     1   1      554        -6      812       -25      DL
## 6          6 2013     1   1      554        -4      740        12      UA
##   tailnum flight origin dest air_time distance hour minute
## 1  N14228   1545    EWR  IAH      227     1400    5     17
## 2  N24211   1714    LGA  IAH      227     1416    5     33
## 3  N619AA   1141    JFK  MIA      160     1089    5     42
## 4  N804JB    725    JFK  BQN      183     1576    5     44
## 5  N668DN    461    LGA  ATL      116      762    5     54
## 6  N39463   1696    EWR  ORD      150      719    5     54

From now on we’ll only be using the flights_data that we imported with Python. We’ll also start to slowly transition to using R over Python.

Missing Data

It’s important to handle missing data correctly because missing values can influence analyses.

There are two types of missing data. The first is called “explicitly” missing data. Explicitly missing data values show up in R or Python as NA or NaN and these values tell you that there is no data for a given cell. Here is an example:

df <- tibble(
  state = c("CA", "CA", "AZ", "AZ", "TX", "TX"),
  year = c(2000, 2001, 2000, 2001, 2000, 2001),
  enrollments = c(1521, 1231, 3200, 2785, 6731, NA)
)
df

## # A tibble: 6 × 3
##   state  year enrollments
##   <chr> <dbl>       <dbl>
## 1 CA     2000        1521
## 2 CA     2001        1231
## 3 AZ     2000        3200
## 4 AZ     2001        2785
## 5 TX     2000        6731
## 6 TX     2001          NA

It’s clear that there is no data for Texas in 2001. But, what if there just was no row of data instead of an NA value? For example:

df <- tibble(
  state = c("CA", "CA", "AZ", "AZ", "TX"),
  year = c(2000, 2001, 2000, 2001, 2000),
  enrollments = c(1521, 1231, 3200, 2785, 6731)
)
df

## # A tibble: 5 × 3
##   state  year enrollments
##   <chr> <dbl>       <dbl>
## 1 CA     2000        1521
## 2 CA     2001        1231
## 3 AZ     2000        3200
## 4 AZ     2001        2785
## 5 TX     2000        6731

This would be known as an “implicitly” missing value. We don’t have the data for Texas which, by definition, means that it’s missing, but it’s not explicit in the data. If we wanted to make it explicit, we could use the complete() function which fills out all possible values for a combination of variables:

complete(df, state, year)

## # A tibble: 6 × 3
##   state  year enrollments
##   <chr> <dbl>       <dbl>
## 1 AZ     2000        3200
## 2 AZ     2001        2785
## 3 CA     2000        1521
## 4 CA     2001        1231
## 5 TX     2000        6731
## 6 TX     2001          NA

Sometimes explicitly missing values are coded as -99 or -999 and that can affect statistical summaries like the mean, median, and even correlations.

Both R and Python have default ways of handling missing values. In R, they are displayed as NA and in Python as NaN. Let’s see an example of this:

flights_data

##         Unnamed: 0  year  month  day  ...  air_time  distance  hour  minute
## 0                1  2013      1    1  ...     227.0      1400   5.0    17.0
## 1                2  2013      1    1  ...     227.0      1416   5.0    33.0
## 2                3  2013      1    1  ...     160.0      1089   5.0    42.0
## 3                4  2013      1    1  ...     183.0      1576   5.0    44.0
## 4                5  2013      1    1  ...     116.0       762   5.0    54.0
## ...            ...   ...    ...  ...  ...       ...       ...   ...     ...
## 336771      336772  2013      9   30  ...       NaN       213   NaN     NaN
## 336772      336773  2013      9   30  ...       NaN       198   NaN     NaN
## 336773      336774  2013      9   30  ...       NaN       764   NaN     NaN
## 336774      336775  2013      9   30  ...       NaN       419   NaN     NaN
## 336775      336776  2013      9   30  ...       NaN       431   NaN     NaN
## 
## [336776 rows x 17 columns]

What columns have missing values?

In Python we can calculate this by pairing .isnull() and .any() methods:

flights_data.isnull().any()

## Unnamed: 0    False
## year          False
## month         False
## day           False
## dep_time       True
## dep_delay      True
## arr_time       True
## arr_delay      True
## carrier       False
## tailnum        True
## flight        False
## origin        False
## dest          False
## air_time       True
## distance      False
## hour           True
## minute         True
## dtype: bool

Here we see that quite a few variables have missing values.

How many missing values are there?

flights_data.isnull().sum()

## Unnamed: 0       0
## year             0
## month            0
## day              0
## dep_time      8255
## dep_delay     8255
## arr_time      8713
## arr_delay     9430
## carrier          0
## tailnum       2512
## flight           0
## origin           0
## dest             0
## air_time      9430
## distance         0
## hour          8255
## minute        8255
## dtype: int64

We could replicate this with R code, by Python provides pretty good tools to answer these questions.

Next, let’s look at the summary statistics for our data to check for any unusual values:

flights_data.describe()

##           Unnamed: 0      year  ...           hour         minute
## count  336776.000000  336776.0  ...  328521.000000  328521.000000
## mean   168388.500000    2013.0  ...      13.173544      31.755501
## std     97219.001466       0.0  ...       4.894426      18.230997
## min         1.000000    2013.0  ...       0.000000       0.000000
## 25%     84194.750000    2013.0  ...       9.000000      16.000000
## 50%    168388.500000    2013.0  ...      14.000000      31.000000
## 75%    252582.250000    2013.0  ...      17.000000      49.000000
## max    336776.000000    2013.0  ...      24.000000      59.000000
## 
## [8 rows x 13 columns]

What should we do with missing values?

There really isn’t a “right” way to handle missing values, so you just need to think carefully about whatever decision you make. The solution that we’ll use here is just to drop all missing values. Why? Missing values can’t be put in a plot or used in a regression, so we don’t necessarily need them. Theoretically, missing values are important because it could mean something more that just “no data.” For example, data might be systematically missing for a particular group of people. That would definitely affect our analyses and we would want to address it. That’s a very different situation than having a missing value for a random, non-systematic reason.

Now that we’ve seen a small comparison of Python with R, we’re going to start primarily using R to do the rest of our cleaning and exploration. We’ll be using R’s tidyverse specifically, which is a collection of packages that make R syntax easy to read, easy to type, and powerful.

So, let’s drop all missing values, understanding that such a decision many not be the best for all situations. We’ll accomplish this with the drop_na() function provided by the dplyr R package (one of the packages included in the tidyverse). We’re also going to create a new object to save our changes to so that we can start building the dataset that we need to analysis and so that we can get back to our original data if needed.

flights_no_na <- drop_na(py$flights_data)

How many rows did we drop? Let’s check:

# before dropping NAs
nrow(py$flights_data)

## [1] 336776

# after dropping NAs
nrow(flights_no_na)

## [1] 327346

# total rows dropped
nrow(py$flights_data) - nrow(flights_no_na)

## [1] 9430

Let’s use Python to confirm that we dropped all missing values:

r.flights_no_na.isnull().any()

## Unnamed: 0    False
## year          False
## month         False
## day           False
## dep_time      False
## dep_delay     False
## arr_time      False
## arr_delay     False
## carrier       False
## tailnum       False
## flight        False
## origin        False
## dest          False
## air_time      False
## distance      False
## hour          False
## minute        False
## dtype: bool

Looks good!

Renaming Variables

Another common data cleaning task is to rename variables. Sometimes variable names are completely uninformative - something like CMP23A_001. That doesn’t make it any easier to understand the data. Let’s check out the names of our variables in our dataset:

colnames(flights_no_na)

##  [1] "Unnamed: 0" "year"       "month"      "day"        "dep_time"  
##  [6] "dep_delay"  "arr_time"   "arr_delay"  "carrier"    "tailnum"   
## [11] "flight"     "origin"     "dest"       "air_time"   "distance"  
## [16] "hour"       "minute"

These are actually pretty good, but we’ll make some changes anyway. Most of our variable names have a _ separator when they involve two words. tailnum is the exception. Let’s rename this variable to tail_num.

To do this, we’re going to make use of the “pipe” operator: %>%. This is a special operator that is used to chain multiple functions together. Chaining functions allows us to execute several operations quickly. I’ll right out the code to rename tailnum below with comments to teach you what each step is doing:

# first we make an assignment to save our changes
# since we want to use the same object name, we'll assign all of our changes to it
# this first line can be read aloud as: "create an object named 'flights_no_na'
# using the existing object 'flights_no_na' THEN ...
flights_no_na <- flights_no_na %>% 
  # rename NEW NAME = OLD NAME
  rename(tail_num = tailnum)

colnames(flights_no_na)

##  [1] "Unnamed: 0" "year"       "month"      "day"        "dep_time"  
##  [6] "dep_delay"  "arr_time"   "arr_delay"  "carrier"    "tail_num"  
## [11] "flight"     "origin"     "dest"       "air_time"   "distance"  
## [16] "hour"       "minute"

We’ll also use this opportunity to drop the variable Unnamed: 0 since it doesn’t appear to contain any useful information. We can drop a variable with the select() function. select() is used to pick out only the variables that we want like this:

flights_no_na %>% 
  select(origin, dest) %>% 
  head()

##   origin dest
## 1    EWR  IAH
## 2    LGA  IAH
## 3    JFK  MIA
## 4    JFK  BQN
## 5    LGA  ATL
## 6    EWR  ORD

Here I only selected the origin and dest variables but I made no assignment. That means that I didn’t make any changes to the flights_no_na data frame. See:

head(flights_no_na)

##   Unnamed: 0 year month day dep_time dep_delay arr_time arr_delay carrier
## 1          1 2013     1   1      517         2      830        11      UA
## 2          2 2013     1   1      533         4      850        20      UA
## 3          3 2013     1   1      542         2      923        33      AA
## 4          4 2013     1   1      544        -1     1004       -18      B6
## 5          5 2013     1   1      554        -6      812       -25      DL
## 6          6 2013     1   1      554        -4      740        12      UA
##   tail_num flight origin dest air_time distance hour minute
## 1   N14228   1545    EWR  IAH      227     1400    5     17
## 2   N24211   1714    LGA  IAH      227     1416    5     33
## 3   N619AA   1141    JFK  MIA      160     1089    5     42
## 4   N804JB    725    JFK  BQN      183     1576    5     44
## 5   N668DN    461    LGA  ATL      116      762    5     54
## 6   N39463   1696    EWR  ORD      150      719    5     54

All the variables are still there!

Now, to drop Unnamed: 0 we can use a minus sign - in the select function like this:

flights_no_na %>% 
  select(-`Unnamed: 0`) %>% 
  head()

##   year month day dep_time dep_delay arr_time arr_delay carrier tail_num flight
## 1 2013     1   1      517         2      830        11      UA   N14228   1545
## 2 2013     1   1      533         4      850        20      UA   N24211   1714
## 3 2013     1   1      542         2      923        33      AA   N619AA   1141
## 4 2013     1   1      544        -1     1004       -18      B6   N804JB    725
## 5 2013     1   1      554        -6      812       -25      DL   N668DN    461
## 6 2013     1   1      554        -4      740        12      UA   N39463   1696
##   origin dest air_time distance hour minute
## 1    EWR  IAH      227     1400    5     17
## 2    LGA  IAH      227     1416    5     33
## 3    JFK  MIA      160     1089    5     42
## 4    JFK  BQN      183     1576    5     44
## 5    LGA  ATL      116      762    5     54
## 6    EWR  ORD      150      719    5     54

A couple things to note here. First, I made no assignment so I didn’t save our changes (i.e. Unnamed: 0 is still in our original data frame). Second, check out the `` marks around Unnamed: 0. We need those because Unnamed: 0 has white space and that makes R mad.

With all that in mind, let’s drop Unnamed: 0 once and for all by including an assignment in our code:

flights_no_na <- flights_no_na %>% 
  select(-`Unnamed: 0`)

Exercise

Rename dest as destination.

Creating Variables

Creating variables is easy with the mutate() function. Say we want to create a new variable that calculates the total delay (tot_delay) that is the sum of the departure delay (dep_delay) and the arrival delay (arr_delay) of each flight:

flights_no_na <- flights_no_na %>% 
  mutate(tot_delay = dep_delay + arr_delay)

We can create more that one variable at a time like this:

flights_no_na <- flights_no_na %>% 
  mutate(tot_delay = dep_delay + arr_delay,
         tot_delay_alt  = (dep_delay + arr_delay) / 60)

mutate() can also be used to modify existing variables. For example, suppose we wanted to convert carrier to be a categorical type of data:

flights_no_na <- flights_no_na %>% 
  mutate(carrier = as.factor(carrier))

head(flights_no_na)

##   year month day dep_time dep_delay arr_time arr_delay carrier tail_num flight
## 1 2013     1   1      517         2      830        11      UA   N14228   1545
## 2 2013     1   1      533         4      850        20      UA   N24211   1714
## 3 2013     1   1      542         2      923        33      AA   N619AA   1141
## 4 2013     1   1      544        -1     1004       -18      B6   N804JB    725
## 5 2013     1   1      554        -6      812       -25      DL   N668DN    461
## 6 2013     1   1      554        -4      740        12      UA   N39463   1696
##   origin dest air_time distance hour minute tot_delay
## 1    EWR  IAH      227     1400    5     17        13
## 2    LGA  IAH      227     1416    5     33        24
## 3    JFK  MIA      160     1089    5     42        35
## 4    JFK  BQN      183     1576    5     44       -19
## 5    LGA  ATL      116      762    5     54       -31
## 6    EWR  ORD      150      719    5     54         8

Exercise

Use mutate() to compute the total air time for each flight (Hint: use dep_time and arr_time). Does this value equal the value in air_time. Why or why not?

The “tidy data” Philosophy

The flights data we were working with is easy to work with because it’s pretty much clean as is. In the real world, you’ll almost never be so lucky. Data is easy to work with when it meets three conditions:

Source: R for Data Science Ch. 12

Each variable has its own column.
Each observation has its own row.
Each value has its own cell.

Let’s look at some example datasets to get a better sense of what I’m talking about.

table1

## # A tibble: 6 × 4
##   country      year  cases population
##   <chr>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

Does each variable have its own column? Is each observation in a row? Does each value have its own cell?

What about these datasets?

table2

## # A tibble: 12 × 4
##    country      year type            count
##    <chr>       <int> <chr>           <int>
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583

table3

## # A tibble: 6 × 3
##   country      year rate             
## * <chr>       <int> <chr>            
## 1 Afghanistan  1999 745/19987071     
## 2 Afghanistan  2000 2666/20595360    
## 3 Brazil       1999 37737/172006362  
## 4 Brazil       2000 80488/174504898  
## 5 China        1999 212258/1272915272
## 6 China        2000 213766/1280428583

table4a

## # A tibble: 3 × 3
##   country     `1999` `2000`
## * <chr>        <int>  <int>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766

These are more typical of what you’ll be working with in your own projects. To “tidy” these up, we’ll need to learn a couple more tools:

pivoting
separating

Pivoting Data

Pivoting data is when you rearrange either the rows or columns of your data to make it so that each column is a variable and each row is an observation. There are two types of pivots: longer and wider.

Pivoting Longer

We need to pivot data to be longer when we have columns of data that represent values, not variables. table4a is a good example.

table4a

## # A tibble: 3 × 3
##   country     `1999` `2000`
## * <chr>        <int>  <int>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766

Here, 1999 and 2000 are obviously the names of values (years) so they should share a single column. We can use pivot_longer to accomplish this:

table4a %>% 
  pivot_longer(cols = c(`1999`, `2000`), names_to = "year", values_to = "value")

## # A tibble: 6 × 3
##   country     year   value
##   <chr>       <chr>  <int>
## 1 Afghanistan 1999     745
## 2 Afghanistan 2000    2666
## 3 Brazil      1999   37737
## 4 Brazil      2000   80488
## 5 China       1999  212258
## 6 China       2000  213766

Pivoting Wider

When we have multiple variables in a single column, we need to pivot our data wider. table2 is a good example of this:

table2

## # A tibble: 12 × 4
##    country      year type            count
##    <chr>       <int> <chr>           <int>
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583

We can fix this with pivot_wider:

table2 %>% 
  pivot_wider(id_cols = c(country, year), names_from = type, values_from = count)

## # A tibble: 6 × 4
##   country      year  cases population
##   <chr>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

Exercises

Use pivot_wider to pivot this data into a tidy format:

df <- tibble(
  state = c("CA", "CA", "AZ", "AZ", "TX"),
  year = c(2000, 2000, 2000, 2000, 2000),
  category = c("new_enrollments", "prev_enrollments", "new_enrollments",
               "prev_enrollments", "new_enrollments"),
  count = c(1521, 30040, 3200, 45002, 5671)
)

Why is prev_enrollments show a missing value for Texas? If we wanted to fill in missing values with a value of 0, how can we do that with pivot_wider?

Use pivot_longer to tidy this data:

df <- tibble(
  state = c("CA", "AZ"),
  `2000` = c(1342, 2310),
  `2001` = c(1204, 2013)
)

Separating

Sometimes, a column of data will contain two values, which isn’t good. table3 shows an example:

table3

## # A tibble: 6 × 3
##   country      year rate             
## * <chr>       <int> <chr>            
## 1 Afghanistan  1999 745/19987071     
## 2 Afghanistan  2000 2666/20595360    
## 3 Brazil       1999 37737/172006362  
## 4 Brazil       2000 80488/174504898  
## 5 China        1999 212258/1272915272
## 6 China        2000 213766/1280428583

Here, the rate variable is showing the number of cases divided by (/) the population. We can use separate() to make this a lot cleaner:

table3 %>% 
  separate(col = rate, into = c("cases", "population"), sep = "/")

## # A tibble: 6 × 4
##   country      year cases  population
##   <chr>       <int> <chr>  <chr>     
## 1 Afghanistan  1999 745    19987071  
## 2 Afghanistan  2000 2666   20595360  
## 3 Brazil       1999 37737  172006362 
## 4 Brazil       2000 80488  174504898 
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

This is good, but notice that our new variables were classified to characters. That means they aren’t treated as numbers. To fix this, we can use add in the convert argument and set it equal to TRUE. This will tell R to detect what type of data our new variables should be:

table3 %>% 
  separate(col = rate, into = c("cases", "population"), sep = "/", 
           convert = TRUE)

## # A tibble: 6 × 4
##   country      year  cases population
##   <chr>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

No that they are a numeric type, we can add in a mutate() to calculate the actual rate:

table3 %>% 
  separate(col = rate, into = c("cases", "population"), sep = "/", 
           convert = TRUE) %>% 
  mutate(rate = cases / population)

## # A tibble: 6 × 5
##   country      year  cases population      rate
##   <chr>       <int>  <int>      <int>     <dbl>
## 1 Afghanistan  1999    745   19987071 0.0000373
## 2 Afghanistan  2000   2666   20595360 0.000129 
## 3 Brazil       1999  37737  172006362 0.000219 
## 4 Brazil       2000  80488  174504898 0.000461 
## 5 China        1999 212258 1272915272 0.000167 
## 6 China        2000 213766 1280428583 0.000167

Exercises

Use separate() to tidy this dataset:

df <- tibble(
  state = c("CA", "CA", "AZ", "AZ", "TX", "TX"),
  year = c(2000, 2001, 2000, 2001, 2000, 2001),
  net_change = c("1521-30040", "2071-32040", "3200-45002", "2970-42412", 
                 "5671-67210", "5891-63610")
)

Save Your Cleaned Data

After working through your data cleaning, it’s a good idea to save your cleaned data to a new file that we can import later to do more exploration and analysis.

Before saving, there is just one more issue to resolve in our data. Somewhere in the translation between Python and R, the variable tailnum became a list. Check it out:

class(flights_no_na$tail_num)

## [1] "list"

Let’s get that back to a regular character column.

flights_no_na <- 
  flights_no_na %>% 
  unnest(tail_num)

Ok, all is right in the world now. Let’s save the final dataset.

We can write our data to a CSV with the write_csv() function in R and specifying a file path to save it to:

write_csv(flights_no_na, "~/Documents/my_project/flights_cleaned.csv")

CSVs are a great general format, but there are other formats we could use as well. One format specific to R and Python is called “feather” which is a format designed by Wes McKinney (author of pandas) and Hadley Wickham (the principle author of the tidyverse) to be read by R and Python. Feather formats are fast and efficient.

library(feather)
write_feather(flights_no_na, "~/Documents/my_project/flights_cleaned.feather")

Summary

In this lesson we learned how R and Python can be used together to work with data in RStudio. While we only scratched the surface of R and the tidyverse, we saw how it has powerful data cleaning functions that will help prepare data for further analysis.

Next time, we’ll learn how to visualize our data and identify patterns.

Resources for Further Study

Last updated on January 1, 0001