Basic Data Cleaning with R
Overview
Now we’re going to learn how to prepare our data for analysis and how to visualize it!
Importing the Data
Start by importing the data set. Don’t forget to put the file name in ""
.
leg.data <- read.csv("leg_data.csv")
Working with Categorical Data
Let’s give the 1’s and 2’s of party
more informative labels. To do this in R, we’ll convert this variable to a categorical variable with the factor
function.
leg.data$party <- factor(x = leg.data$party,
levels = c(1, 2), # these are the values in party
# Label the levels with these labels:
labels = c("Republican", "Democrat"))
summary(leg.data$party)
## Republican Democrat NA's
## 3 3 1
Let’s look at the full data set to see what we did.
View(leg.data)
Notice that R automatically converted the value of -99 to “NA” which stands for “Not Available.” This is how R tells us that the data is missing. This happened because we didn’t list “-99” as one of the category options.
When we ask R to convert one type of data to another, and it encounters an impossible case it gives us a warning and creates an NA
value. For example:
x <- c("1", "b", "3")
class(x)
## [1] "character"
We just created a character vector. Let’s see what happens when we try to convert it to numeric data.
as.numeric(x)
## Warning: NAs introduced by coercion
## [1] 1 NA 3
R does not have any guesses for what number you want when you type b
, so it does not try.
Naming Data
It’s always a good idea to have informative variable names. This helps us keep track of what data we have. Let’s look at the names in our data.
names(leg.data)
## [1] "x" "party" "st" "cd" "XMP098" "M0974"
x
, XMP098
, and M0974
aren’t very descriptive. Let’s fix that.
names(leg.data)[1] <- "name"
names(leg.data)
## [1] "name" "party" "st" "cd" "XMP098" "M0974"
Ok, that fixed the first one. We can use the same process to fix the last two.
Suppose we want to sort our data by year from the oldest date to the most recent. We can use the sort()
function for this.
sort(leg.data$year)
## NULL
Summary Statistics
We can calculate some quick summary statistics with the summary() function
.
summary(leg.data)
## name party st cd
## Length:7 Republican:3 Length:7 Min. :1.000
## Class :character Democrat :3 Class :character 1st Qu.:1.500
## Mode :character NA's :1 Mode :character Median :2.000
## Mean :2.714
## 3rd Qu.:3.000
## Max. :7.000
##
## XMP098 M0974
## Min. :-6.0000 Min. :1958
## 1st Qu.: 1.0000 1st Qu.:1966
## Median : 1.0000 Median :1969
## Mean : 0.4286 Mean :1970
## 3rd Qu.: 2.0000 3rd Qu.:1974
## Max. : 2.0000 Max. :1980
## NA's :1
These don’t look very pretty though. Let’s clean them up with the practice problems!
Practice Problems
Give
XMP098
(this is the variable for sex) andM094
(this is the variable for birth year) more informative variable names.What type of data should
cd
be? Can you convert it to that type?Give the values “1” and “2” of
XMP098
informative labels.What is the average birth year?