18 Basic Base R and Tidyverse Data Manipulation Cheat Sheet
This document’s purpose is to serve as a simple reference guide comparing a handful of basic functions in base R and the dplyr package, which is part of the tidyverse. It is designed to help you begin to understand some basic differences between the two as you learn R and come across different styles on the web and elsewhere. It does not provide you with all “data manipulation” (or “wrangling” and “carpentry”) techniques in R or in the tidyverse. Several other packages exist that are part of the tidyverse, such as tidyr, stringr, and purrr, and will support you with key functions and procedures depending on the type of data with which you are working and your analytic needs. In fact, we will most certainly work with these packages during the quarter. An excellent place to start exploring the tidyverse is https://bookdown.org, which contains useful references for a variety of topics.
Finally, you can switch back and forth between base R and tidyverse in your analysis; they are not mutually exclusive. However, several advantages exist in staying consistent in your code.
18.1 Basic Base R and dplyr Functions
18.2 Installation
The first step is to make sure that you have installed and loaded the necessary packages.
# No installation required
install.packages("tidyverse")
# Alternatively, only install dplyr and readr
install.packages("dplyr")
install.packages("readr")
18.3 Loading
You can either install and load tidyverse, or you can load specific
packages within it, such as dplyr. Though it is not required for
this demonstration, we will use ::
to call dplyr before using one
of its functions, in part, because some packages have the same name for
different functions and we think this is a useful practice when using
multiple packages during an analysis, which will be the case most of
time.
# No loading required.
# You may load the whole tidyverse
library(tidyverse)
# Alternatively load packages from the tidyverse
library(dplyr)
library(readr)
The tidyverse often utilizes “piping” (%>%
) to execute an action.
You can think of the %>%
as saying “and then” followed by a function
or action. The use of this operator is not required, but we will use it
here.
For example, you can think of a data frame object as a noun in a
sentence, and the functions as verbs. Compare the following two morning
routines to see the advantages of using %>%
to write programming
instructions:
go_work(get_ready(eat(wake_up(you))))
%>% # 'and then...'
you wake_up() %>% # 'and then...'
eat() %>% # 'and then...'
get_ready() %>% # 'and then...'
go_work() # 'and then...'
Which “sentence” makes more sense in understanding the sequence of events? Some prefer the syntax on the right as it improves readability.
18.4 Import and View Data
Go ahead and load the data set for this walk through, which is a Twitter data set pulled from Twitter’s open API that focuses on the Popular Mobilization Forces (PMF) in Iraq.
Some people feel more comfortable using base R functions to import data
with read.csv()
.
<- read.csv("data/twitter_pmf.csv",
df header = TRUE)
Here are some simple functions to view your data.
# Get the dimensions of a data frame.
dim(df)
# Print column names in your data set.
colnames(df)
# Print a few rows of your data set.
head(df) # Top rows
tail(df) # Bottom rows
# Get an object summary
summary(df)
# Observe your data set in a separate window.
View(df)
To import csv files using the tidy framework, use the readr
package’s read_csv()
function.
<- readr::read_csv("data/twitter_pmf.csv",
tb col_names = TRUE)
Inspect you data with glimpse()
and others.
# View the data in the console.
%>%
tb ::glimpse()
dplyr
# Describe dimensions.
%>%
tb ::dim_desc()
dplyr
# Use base R descriptive functions.
%>%
tb head()
%>%
tb colnames()
%>%
tb summary()
Note that in this cheat sheet we are not assigning the output from each operation into a new object. If the printed output in the console is not enough to compare the differences between the base R and tidy grammar, you may want to assign the output to new objects and inspect it in the viewer like so:
<- head(df)
my_df_head
View(my_df_head)
%>%
tb head() %>%
View()
18.5 Extracting Variables/Columns
Extracting variables in base R uses the [
accessor in combination with
c()
to extract the desired columns.
c("screen_name",
df[, "retweet_count",
"location", "followers_count")]
You can extract by column index in base R. The numbers indicate the
column numbers (e.g., screen_name
is column 4 in our data).
# Combine indexes with c()
c(4, 14, 74, 78)] df[,
Use dplyr’s select()
function to extract a handful of columns of
interest. Note that variable names are unquoted and separated by commas.
%>%
tb ::select(screen_name, retweet_count,
dplyr location, followers_count)
You can use a column index as well. Note you can just declare the column
indexes inside the select()
function.
%>%
tb ::select(4, 14, 74, 78) dplyr
You could take a look at the top 5 rows of the selected variables by combining functions.
# Combine indexes with c()
head(df[, c(4, 14, 78)],
n = 5)
screen_name retweet_count followers_count
1 warmediateam 1 3073
2 warmediateam 2 3073
3 warmediateam 2 3073
4 warmediateam 3 3073
5 warmediateam 3 3073
%>%
tb ::select(4, 14, 78) %>%
dplyrhead(n = 5)
# A tibble: 5 x 3
screen_name retweet_count followers_count
<chr> <dbl> <dbl>
1 warmediateam 1 3073
2 warmediateam 2 3073
3 warmediateam 2 3073
4 warmediateam 3 3073
5 warmediateam 3 3073
18.6 Creating Filters/Extracting Rows
Here we will create a filter for all tweets that have been retweeted 10 or more times.
Filtering in base R requires the [
accessor.
# The $ accessor to gets the vector.
$retweet_count >= 10, ] df[df
Additionally, the subset()
function produces the same results.
subset(df, retweet_count >= 10)
Create filters using the filter()
function.
%>%
tb ::filter(retweet_count >= 10) dplyr
You may use multiple filtering criteria by chaining the arguments with
the |
(or) and &
(and) operators. For instance, add a second filter
to return only tweets from the account “warmediateam.”
subset(df,
>= 10 &
retweet_count == "warmediateam") screen_name
%>%
tb ::filter(retweet_count >= 10 &
dplyr== "warmediateam") screen_name
For a much clearer print out, combine the filtering techniques with the variable selection methods:
# Pass limited data frame to subset().
subset(df[, c(4, 14, 74, 78)],
>= 10 &
retweet_count == "warmediateam") screen_name
%>%
tb ::filter(retweet_count >= 10 &
dplyr== "warmediateam") %>%
screen_name ::select(4, 14, 74, 78) dplyr
You could take a look at the top rows of the filtered output combining functions.
head(
subset(df[, c(4, 14, 78)],
>= 10 &
retweet_count == "warmediateam"),
screen_name n = 5)
screen_name retweet_count followers_count
17 warmediateam 22 3073
31 warmediateam 13 3073
33 warmediateam 13 3073
36 warmediateam 10 3073
37 warmediateam 10 3073
%>%
tb ::filter(retweet_count >= 10 &
dplyr== "warmediateam") %>%
screen_name ::select(4, 14, 78) %>%
dplyrhead(n = 5)
# A tibble: 5 x 3
screen_name retweet_count followers_count
<chr> <dbl> <dbl>
1 warmediateam 22 3073
2 warmediateam 13 3073
3 warmediateam 13 3073
4 warmediateam 10 3073
5 warmediateam 10 3073
18.7 Arrange Rows
Here we will arrange that data frame alphabetically by screen_name
.
Arranging in base R requires the [
accessor and the order()
function.
# Use of the $ accessor to order a variable.
order(df$screen_name), ] df[
In dplyr use the arrange()
function to sort in ascending order.
%>%
tb ::arrange(screen_name) dplyr
You can clean up the printout by combining functions to select relevant variables and examine the top 5 rows.
head(
order(df$screen_name),
df[c("screen_name", "retweet_count")],
n = 5)
screen_name retweet_count
3729 5qxsN9lLdn1Nanl 0
3730 5qxsN9lLdn1Nanl 36
3731 5qxsN9lLdn1Nanl 0
3732 5qxsN9lLdn1Nanl 24
3733 5qxsN9lLdn1Nanl 0
%>%
tb ::arrange(screen_name) %>%
dplyr::select(screen_name, retweet_count) %>%
dplyrhead(n = 5)
# A tibble: 5 x 2
screen_name retweet_count
<chr> <dbl>
1 5qxsN9lLdn1Nanl 0
2 5qxsN9lLdn1Nanl 36
3 5qxsN9lLdn1Nanl 0
4 5qxsN9lLdn1Nanl 24
5 5qxsN9lLdn1Nanl 0
You may reverse the order of the variables arranged.
Set the decreasing
argument to TRUE
.
order(df$screen_name,
df[decreasing = TRUE), ]
Include the desc()
helper function.
%>%
tb ::arrange(desc(screen_name)) dplyr
Once again, for a much clearer print out, combine the arranging techniques with the variable selection methods.
head(
order(df$screen_name, decreasing = TRUE),
df[c(4, 14)],
n = 5)
screen_name retweet_count
4036 zaidaliallawi 0
4037 zaidaliallawi 0
4038 zaidaliallawi 0
4039 zaidaliallawi 0
4040 zaidaliallawi 0
%>%
tb ::arrange(desc(screen_name)) %>%
dplyr::select(4, 14) %>%
dplyrhead(n = 5)
# A tibble: 5 x 2
screen_name retweet_count
<chr> <dbl>
1 zaidaliallawi 0
2 zaidaliallawi 0
3 zaidaliallawi 0
4 zaidaliallawi 0
5 zaidaliallawi 0
18.8 Making a New Column/Variable
The new variable we will create here, retweet_success
, is for
demonstration purposes. The variable will be generated by dividing the
retweet_count
by the followers_count
.
In base R, creating a new column requires using the $
accessor.
# Assign the new variable into a data frame
$retweet_success <- df$retweet_count/
df$followers_count df
To create a column in dplyr use the mutate()
function.
%>%
tb ::mutate(retweet_success = retweet_count/
dplyr followers_count)
Once again, you may want to combine functions to get a clearer print out. Here, we will combine how to create and arrange a new variable, and select a handful of columns.
# Create new variable
$retweet_success <- df$retweet_count/
df$followers_count
df# Rearrange, select, and return top 5
head(
order(df$retweet_success, decreasing = TRUE),
df[c("screen_name", "retweet_success")],
n = 5)
screen_name retweet_success
3930 MzcCY48mFeyS1ly Inf
2241 San_Patricio_BN 3342.643
9591 M_ska98 1613.333
4981 b76142 560.000
4148 8KXS2Iq3JAPEDvi 245.500
%>%
tb # Create new variable
::mutate(retweet_success = retweet_count/
dplyr%>%
followers_count) # Rearrange, select, and return top 5
::arrange(desc(retweet_success)) %>%
dplyr::select(screen_name, retweet_success) %>%
dplyrhead(n = 5)
# A tibble: 5 x 2
screen_name retweet_success
<chr> <dbl>
1 MzcCY48mFeyS1ly Inf
2 San_Patricio_BN 3343.
3 M_ska98 1613.
4 b76142 560
5 8KXS2Iq3JAPEDvi 246.
18.9 Rename Variables
Let’s now explore how to rename variables. Begin by taking a look at the
column names, a straightforward way of doing so is using the base R’s
colnames()
function.
colnames(df)
%>% colnames() tb
Now we will select several columns, and then rename them.
names(df)[names(df) == "screen_name"] <- "Screen_Name"
names(df)[names(df) == "retweet_count"] <- "N_Retweets"
head(df[, c("Screen_Name","N_Retweets")],
n = 5)
Screen_Name N_Retweets
1 warmediateam 1
2 warmediateam 2
3 warmediateam 2
4 warmediateam 3
5 warmediateam 3
%>%
tb ::select(screen_name,
dplyr%>%
retweet_count) ::rename(Screen_Name = screen_name,
dplyrN_Retweets = retweet_count) %>%
head(n = 5)
# A tibble: 5 x 2
Screen_Name N_Retweets
<chr> <dbl>
1 warmediateam 1
2 warmediateam 2
3 warmediateam 2
4 warmediateam 3
5 warmediateam 3
18.10 Summarizing
Summary statistics are really useful in describing your data. This can be done both in base R and in dplyr. Here we will explore how to execute simple summaries first; then, we will move on to calculate group summaries.
18.10.1 Simple Summary
To summarize in base R, you will need generic functions to calculate
these statistics. For example, mean()
, median()
, sum()
, etc.
# Create a data frame with the summary.
data.frame(rt_avg = mean(df$retweet_count))
rt_avg
1 245.2969
To summarize data in dplyr, use the summarize()
function to
compute a requested summary (e.g., mean()
, median()
, n()
, etc.).
%>%
tb ::summarize(rt_avg = mean(retweet_count)) dplyr
# A tibble: 1 x 1
rt_avg
<dbl>
1 245.
18.10.2 Group Summary
Summarizing can be expanded by computing the statistics by groups. For
instance, here we will get the retweet count of each individual; thus,
the group would be each screen_name
.
Use the aggregate()
function to slip the data into subsets and then
proceed to compute summary statistics for each.
aggregate(x = df$retweet_count,
by = list(screen_name = df$screen_name),
FUN = mean)
Group a data frame with group_by()
and perform group operations by
adding the summarize()
function.
%>%
tb ::group_by(screen_name) %>%
dplyr::summarize(rt_avg = mean(retweet_count)) dplyr
Once again, you may want to combine some functions to make your output more legible.
<- aggregate(x = df$retweet_count,
out by = list(screen_name=df$screen_name),
FUN = mean)
head(out[order(out$x, decreasing = TRUE), ],
n = 5)
screen_name x
102 TheEisaAli 4909.646
43 EmmaDaly 4481.640
36 DCrising21 2916.490
97 San_Patricio_BN 1581.717
74 MikeyKayNYC 1506.260
%>%
tb ::group_by(screen_name) %>%
dplyr::summarize(rt_avg = mean(retweet_count)) %>%
dplyrarrange(desc(rt_avg)) %>%
head(n = 5)
# A tibble: 5 x 2
screen_name rt_avg
<chr> <dbl>
1 TheEisaAli 4910.
2 EmmaDaly 4482.
3 DCrising21 2916.
4 San_Patricio_BN 1582.
5 MikeyKayNYC 1506.
18.11 Combining Data
Up to this point, we have only covered single table functions. However, when data arrives in many pieces you may need to combine these to complete your analysis.
Here we will bring a second data set, which contains social network analysis metrics for each user account listed in the initial data set.
Once again, you can read the new data set with base R or readr.
<- read.csv("data/SNA_Stats.csv") sna_df
<- readr::read_csv("data/SNA_Stats.csv") sna_tb
Inspect both data sets:
dim(sna_df)
[1] 4146 7
colnames(sna_df)
[1] "screen_name" "componentnumber"
[3] "Eccentricity" "closnesscentrality"
[5] "harmonicclosnesscentrality" "betweenesscentrality"
[7] "modularity_class"
%>% dplyr::dim_desc() sna_tb
[1] "[4,146 x 7]"
%>% colnames() sna_tb
[1] "screen_name" "componentnumber"
[3] "Eccentricity" "closnesscentrality"
[5] "harmonicclosnesscentrality" "betweenesscentrality"
[7] "modularity_class"
In order to connect the tables, a pair of variables called keys are required. One way to identify keys, is to seek for identically named variables in both data sets. This may not always be viable if the keys are named differently in each data set.
Matching data frame names can be accomplished as follows:
intersect(names(df), names(sna_df))
names(tb) %>% dplyr::intersect(names(sna_df))
18.11.1 Left Join
What you see below is a join that retains all rows from the first table.
Use merge()
to join. Note that all.x = TRUE
tells R to keep all
observations from the first table.
merge(df, sna_df,
by = "screen_name",
all.x = TRUE)
The left_join()
function retains all rows from the tb
data frame,
while adding data from sna_tb
.
%>%
tb ::left_join(sna_tb,
dplyrby = "screen_name")
Take a look at the dimensions of the output.
dim(
merge(df, sna_df,
by = "screen_name",
all.x = TRUE))
[1] 10542 100
%>%
tb ::left_join(sna_tb,
dplyrby = "screen_name") %>%
::dim_desc() dplyr
[1] "[10,542 x 97]"
18.11.2 Right Join
A right join retains all rows from the second table.
Use merge()
to join. Note that all.y = TRUE
tells R to keep all
observations from the second table.
merge(df, sna_df,
by = "screen_name",
all.y = TRUE)
The right_join()
function retains all rows from the sna_tb
data
frame, while adding tb
.
%>%
tb ::right_join(sna_tb,
dplyrby = "screen_name")
Take a look at the dimensions of the output.
dim(
merge(df, sna_df,
by = "screen_name",
all.y = TRUE))
[1] 14065 100
%>%
tb ::right_join(sna_tb,
dplyrby = "screen_name") %>%
::dim_desc() dplyr
[1] "[14,065 x 97]"
18.11.3 Inner Join
An inner join retain rows with matches in both tables. SNA_Stats.csv
,
for example, may not include isolates so we would expect fewer results
after comparing dimensions.
Use the merge()
function to inner join.
merge(df, sna_df,
by = "screen_name")
inner_join()
merges based on screen_name
.
%>%
tb ::inner_join(sna_tb, by = "screen_name") dplyr
Take a look at the dimensions of the output.
dim(
merge(df, sna_df,
by = "screen_name"))
[1] 10027 100
%>%
tb ::inner_join(sna_tb, by = "screen_name") %>%
dplyr::dim_desc() dplyr
[1] "[10,027 x 97]"
18.11.4 Full Join
A full join retains all rows in both data sets, regardless of matches.
Use the merge()
function to full join. Note the all = TRUE
argument.
merge(df, sna_df,
by = "screen_name",
all = TRUE)
The full_join()
function will merge based on shared screen_name
.
%>%
tb ::full_join(sna_tb,
dplyrby = "screen_name")
Take a look at the dimensions of the output.
dim(
merge(df, sna_df,
by = "screen_name",
all = TRUE))
[1] 14580 100
%>%
tb ::full_join(sna_tb,
dplyrby = "screen_name") %>%
::dim_desc() dplyr
[1] "[14,580 x 97]"
18.12 Piping Multiple Variables
As you can see, the %>%
operator is a great way to execute multiple
actions in a few lines of code. This last example is meant to show how
much can be done by chaining multiple functions with %>%
. For example,
say we wanted to identify the most retweeted users in our original data
set. To do so, we will have to group rows by screen_name
using
group_by()
, then add the number of retweets per account, filter()
by
the average number of retweets (n = 245), arrange()
the total, and
finally rename the variables.
%>%
tb ::group_by(screen_name) %>%
dplyr::summarise(retweet_total = sum(retweet_count)) %>%
dplyr::filter(retweet_total >= 245) %>%
dplyr::arrange(desc(retweet_total)) %>%
dplyr::rename(Screen_Name = screen_name,
dplyrN_Retweets = retweet_total)
# A tibble: 89 x 2
Screen_Name N_Retweets
<chr> <dbl>
1 TheEisaAli 486055
2 EmmaDaly 448164
3 DCrising21 291649
4 MikeyKayNYC 150626
5 realsohelbahjat 145546
6 San_Patricio_BN 145518
7 m_al_asiri 81549
8 amnesty 80147
9 Protectthenhs 72937
10 YouTube 62145
# ... with 79 more rows
We can clean up the output to only present the top accounts by adding
dplyr’s top_n()
function.
%>%
tb ::group_by(screen_name) %>%
dplyr::summarise(retweet_total = sum(retweet_count)) %>%
dplyr::filter(retweet_total >= 245) %>%
dplyr::arrange(desc(retweet_total)) %>%
dplyr::rename(Screen_Name = screen_name,
dplyrN_Retweets = retweet_total) %>%
::top_n(15) dplyr
# A tibble: 15 x 2
Screen_Name N_Retweets
<chr> <dbl>
1 TheEisaAli 486055
2 EmmaDaly 448164
3 DCrising21 291649
4 MikeyKayNYC 150626
5 realsohelbahjat 145546
6 San_Patricio_BN 145518
7 m_al_asiri 81549
8 amnesty 80147
9 Protectthenhs 72937
10 YouTube 62145
11 nafarrao 44631
12 SulomeAnderson 43000
13 IraqiSecurity 38478
14 USEmbBaghdad 28770
15 The_H16 27462