Basic Data Cleaning with R

Overview

Now we’re going to learn how to prepare our data for analysis and how to visualize it!

Importing the Data

Start by importing the data set. Don’t forget to put the file name in "".

leg.data <- read.csv("leg_data.csv")

Working with Categorical Data

Let’s give the 1’s and 2’s of party more informative labels. To do this in R, we’ll convert this variable to a categorical variable with the factor function.

leg.data$party <- factor(x = leg.data$party,
                         levels = c(1, 2), # these are the values in party
                         # Label the levels with these labels:
                         labels = c("Republican", "Democrat"))

summary(leg.data$party)

## Republican   Democrat       NA's 
##          3          3          1

Let’s look at the full data set to see what we did.

View(leg.data)

Notice that R automatically converted the value of -99 to “NA” which stands for “Not Available.” This is how R tells us that the data is missing. This happened because we didn’t list “-99” as one of the category options.

When we ask R to convert one type of data to another, and it encounters an impossible case it gives us a warning and creates an NA value. For example:

x <- c("1", "b", "3")

class(x)

## [1] "character"

We just created a character vector. Let’s see what happens when we try to convert it to numeric data.

as.numeric(x)

## Warning: NAs introduced by coercion

## [1]  1 NA  3

R does not have any guesses for what number you want when you type b, so it does not try.

Naming Data

It’s always a good idea to have informative variable names. This helps us keep track of what data we have. Let’s look at the names in our data.

names(leg.data)

## [1] "x"      "party"  "st"     "cd"     "XMP098" "M0974"

x, XMP098, and M0974 aren’t very descriptive. Let’s fix that.

names(leg.data)[1] <- "name"
names(leg.data)

## [1] "name"   "party"  "st"     "cd"     "XMP098" "M0974"

Ok, that fixed the first one. We can use the same process to fix the last two.

Suppose we want to sort our data by year from the oldest date to the most recent. We can use the sort() function for this.

sort(leg.data$year)

## NULL

Summary Statistics

We can calculate some quick summary statistics with the summary() function.

summary(leg.data)

##      name                  party        st                  cd       
##  Length:7           Republican:3   Length:7           Min.   :1.000  
##  Class :character   Democrat  :3   Class :character   1st Qu.:1.500  
##  Mode  :character   NA's      :1   Mode  :character   Median :2.000  
##                                                       Mean   :2.714  
##                                                       3rd Qu.:3.000  
##                                                       Max.   :7.000  
##                                                                      
##      XMP098            M0974     
##  Min.   :-6.0000   Min.   :1958  
##  1st Qu.: 1.0000   1st Qu.:1966  
##  Median : 1.0000   Median :1969  
##  Mean   : 0.4286   Mean   :1970  
##  3rd Qu.: 2.0000   3rd Qu.:1974  
##  Max.   : 2.0000   Max.   :1980  
##                    NA's   :1

These don’t look very pretty though. Let’s clean them up with the practice problems!

Practice Problems

Give XMP098 (this is the variable for sex) and M094 (this is the variable for birth year) more informative variable names.
What type of data should cd be? Can you convert it to that type?
Give the values “1” and “2” of XMP098 informative labels.
What is the average birth year?

Last updated on September 20, 2021