18 Basic Base R and Tidyverse Data Manipulation Cheat Sheet

This document’s purpose is to serve as a simple reference guide comparing a handful of basic functions in base R and the dplyr package, which is part of the tidyverse. It is designed to help you begin to understand some basic differences between the two as you learn R and come across different styles on the web and elsewhere. It does not provide you with all “data manipulation” (or “wrangling” and “carpentry”) techniques in R or in the tidyverse. Several other packages exist that are part of the tidyverse, such as tidyr, stringr, and purrr, and will support you with key functions and procedures depending on the type of data with which you are working and your analytic needs. In fact, we will most certainly work with these packages during the quarter. An excellent place to start exploring the tidyverse is https://bookdown.org, which contains useful references for a variety of topics.

Finally, you can switch back and forth between base R and tidyverse in your analysis; they are not mutually exclusive. However, several advantages exist in staying consistent in your code.

18.1 Basic Base R and dplyr Functions

18.2 Installation

The first step is to make sure that you have installed and loaded the necessary packages.

# No installation required

install.packages("tidyverse")

# Alternatively, only install dplyr and readr
install.packages("dplyr")
install.packages("readr")

18.3 Loading

You can either install and load tidyverse, or you can load specific packages within it, such as dplyr. Though it is not required for this demonstration, we will use :: to call dplyr before using one of its functions, in part, because some packages have the same name for different functions and we think this is a useful practice when using multiple packages during an analysis, which will be the case most of time.

# No loading required.

# You may load the whole tidyverse
library(tidyverse)

# Alternatively load packages from the tidyverse
library(dplyr)
library(readr)

The tidyverse often utilizes “piping” (%>%) to execute an action. You can think of the %>% as saying “and then” followed by a function or action. The use of this operator is not required, but we will use it here.

For example, you can think of a data frame object as a noun in a sentence, and the functions as verbs. Compare the following two morning routines to see the advantages of using %>%to write programming instructions:

go_work(get_ready(eat(wake_up(you))))

you %>% # 'and then...'
  wake_up() %>% # 'and then...'
  eat() %>% # 'and then...'
  get_ready() %>% # 'and then...'
  go_work() # 'and then...'

Which “sentence” makes more sense in understanding the sequence of events? Some prefer the syntax on the right as it improves readability.

18.4 Import and View Data

Go ahead and load the data set for this walk through, which is a Twitter data set pulled from Twitter’s open API that focuses on the Popular Mobilization Forces (PMF) in Iraq.

Some people feel more comfortable using base R functions to import data with read.csv().

df <- read.csv("data/twitter_pmf.csv",
               header = TRUE)

Here are some simple functions to view your data.

# Get the dimensions of a data frame.
dim(df)  

# Print column names in your data set.
colnames(df)

# Print a few rows of your data set.
head(df) # Top rows
tail(df) # Bottom rows

# Get an object summary
summary(df)

# Observe your data set in a separate window.
View(df)

To import csv files using the tidy framework, use the readr package’s read_csv() function.

tb <- readr::read_csv("data/twitter_pmf.csv",
                      col_names = TRUE)

Inspect you data with glimpse() and others.

# View the data in the console.
tb %>% 
  dplyr::glimpse()

# Describe dimensions.
tb %>% 
  dplyr::dim_desc()

# Use base R descriptive functions.
tb %>%
  head()
tb %>% 
  colnames()
tb %>% 
  summary()

Note that in this cheat sheet we are not assigning the output from each operation into a new object. If the printed output in the console is not enough to compare the differences between the base R and tidy grammar, you may want to assign the output to new objects and inspect it in the viewer like so:

my_df_head <- head(df)

View(my_df_head)

tb %>%
  head() %>%
  View()

18.5 Extracting Variables/Columns

Extracting variables in base R uses the [ accessor in combination with c() to extract the desired columns.

df[, c("screen_name",
       "retweet_count",
       "location", "followers_count")]

You can extract by column index in base R. The numbers indicate the column numbers (e.g., screen_name is column 4 in our data).

# Combine indexes with c()
df[, c(4, 14, 74, 78)]

Use dplyr’s select() function to extract a handful of columns of interest. Note that variable names are unquoted and separated by commas.

tb %>%
  dplyr::select(screen_name,  retweet_count, 
         location, followers_count)

You can use a column index as well. Note you can just declare the column indexes inside the select() function.

tb %>%
  dplyr::select(4, 14, 74, 78)

You could take a look at the top 5 rows of the selected variables by combining functions.

# Combine indexes with c()
head(df[, c(4, 14, 78)],
     n = 5)

   screen_name retweet_count followers_count
1 warmediateam             1            3073
2 warmediateam             2            3073
3 warmediateam             2            3073
4 warmediateam             3            3073
5 warmediateam             3            3073

tb %>%
  dplyr::select(4, 14, 78) %>%
  head(n = 5)

# A tibble: 5 x 3
  screen_name  retweet_count followers_count
  <chr>                <dbl>           <dbl>
1 warmediateam             1            3073
2 warmediateam             2            3073
3 warmediateam             2            3073
4 warmediateam             3            3073
5 warmediateam             3            3073

18.6 Creating Filters/Extracting Rows

Here we will create a filter for all tweets that have been retweeted 10 or more times.

Filtering in base R requires the [ accessor.

# The $ accessor to gets the vector.
df[df$retweet_count >= 10, ]

Additionally, the subset() function produces the same results.

subset(df, retweet_count >= 10)

Create filters using the filter() function.

tb %>%
  dplyr::filter(retweet_count >= 10)

You may use multiple filtering criteria by chaining the arguments with the | (or) and & (and) operators. For instance, add a second filter to return only tweets from the account “warmediateam.”

subset(df, 
       retweet_count >= 10 &
       screen_name == "warmediateam")

tb %>%
  dplyr::filter(retweet_count >= 10 &
           screen_name == "warmediateam")

For a much clearer print out, combine the filtering techniques with the variable selection methods:

# Pass limited data frame to subset().
subset(df[, c(4, 14, 74, 78)], 
       retweet_count >= 10 &
       screen_name == "warmediateam")

tb %>%
  dplyr::filter(retweet_count >= 10 &
           screen_name == "warmediateam") %>%
  dplyr::select(4, 14, 74, 78)

You could take a look at the top rows of the filtered output combining functions.

head(
  subset(df[, c(4, 14, 78)], 
         retweet_count >= 10 &
        screen_name == "warmediateam"),
  n = 5)

    screen_name retweet_count followers_count
17 warmediateam            22            3073
31 warmediateam            13            3073
33 warmediateam            13            3073
36 warmediateam            10            3073
37 warmediateam            10            3073

tb %>%
  dplyr::filter(retweet_count >= 10 &
           screen_name == "warmediateam") %>%
  dplyr::select(4, 14, 78) %>%
  head(n = 5)

# A tibble: 5 x 3
  screen_name  retweet_count followers_count
  <chr>                <dbl>           <dbl>
1 warmediateam            22            3073
2 warmediateam            13            3073
3 warmediateam            13            3073
4 warmediateam            10            3073
5 warmediateam            10            3073

18.7 Arrange Rows

Here we will arrange that data frame alphabetically by screen_name.

Arranging in base R requires the [ accessor and the order() function.

# Use of the $ accessor to order a variable.
df[order(df$screen_name), ]

In dplyr use the arrange() function to sort in ascending order.

tb %>%
  dplyr::arrange(screen_name)

You can clean up the printout by combining functions to select relevant variables and examine the top 5 rows.

head(
  df[order(df$screen_name), 
     c("screen_name", "retweet_count")],
  n = 5)

         screen_name retweet_count
3729 5qxsN9lLdn1Nanl             0
3730 5qxsN9lLdn1Nanl            36
3731 5qxsN9lLdn1Nanl             0
3732 5qxsN9lLdn1Nanl            24
3733 5qxsN9lLdn1Nanl             0

tb %>%
  dplyr::arrange(screen_name) %>%
  dplyr::select(screen_name, retweet_count) %>%
  head(n = 5)

# A tibble: 5 x 2
  screen_name     retweet_count
  <chr>                   <dbl>
1 5qxsN9lLdn1Nanl             0
2 5qxsN9lLdn1Nanl            36
3 5qxsN9lLdn1Nanl             0
4 5qxsN9lLdn1Nanl            24
5 5qxsN9lLdn1Nanl             0

You may reverse the order of the variables arranged.

Set the decreasing argument to TRUE.

df[order(df$screen_name,
         decreasing = TRUE), ]

Include the desc() helper function.

tb %>%
  dplyr::arrange(desc(screen_name))

Once again, for a much clearer print out, combine the arranging techniques with the variable selection methods.

head(
  df[order(df$screen_name, decreasing = TRUE), 
     c(4, 14)],
  n = 5)

       screen_name retweet_count
4036 zaidaliallawi             0
4037 zaidaliallawi             0
4038 zaidaliallawi             0
4039 zaidaliallawi             0
4040 zaidaliallawi             0

tb %>%
  dplyr::arrange(desc(screen_name)) %>%
  dplyr::select(4, 14) %>%
  head(n = 5)

# A tibble: 5 x 2
  screen_name   retweet_count
  <chr>                 <dbl>
1 zaidaliallawi             0
2 zaidaliallawi             0
3 zaidaliallawi             0
4 zaidaliallawi             0
5 zaidaliallawi             0

18.8 Making a New Column/Variable

The new variable we will create here, retweet_success, is for demonstration purposes. The variable will be generated by dividing the retweet_count by the followers_count.

In base R, creating a new column requires using the $ accessor.

# Assign the new variable into a data frame
df$retweet_success <- df$retweet_count/
  df$followers_count

To create a column in dplyr use the mutate() function.

tb %>%
  dplyr::mutate(retweet_success = retweet_count/
                  followers_count)

Once again, you may want to combine functions to get a clearer print out. Here, we will combine how to create and arrange a new variable, and select a handful of columns.

# Create new variable
df$retweet_success <- df$retweet_count/
  df$followers_count
# Rearrange, select, and return top 5
head(
  df[order(df$retweet_success, decreasing = TRUE),
     c("screen_name", "retweet_success")],
  n = 5)

         screen_name retweet_success
3930 MzcCY48mFeyS1ly             Inf
2241 San_Patricio_BN        3342.643
9591         M_ska98        1613.333
4981          b76142         560.000
4148 8KXS2Iq3JAPEDvi         245.500

tb %>%
  # Create new variable
  dplyr::mutate(retweet_success = retweet_count/
                  followers_count) %>%
  # Rearrange, select, and return top 5
  dplyr::arrange(desc(retweet_success)) %>%
  dplyr::select(screen_name, retweet_success) %>%
  head(n = 5)

# A tibble: 5 x 2
  screen_name     retweet_success
  <chr>                     <dbl>
1 MzcCY48mFeyS1ly            Inf 
2 San_Patricio_BN           3343.
3 M_ska98                   1613.
4 b76142                     560 
5 8KXS2Iq3JAPEDvi            246.

18.9 Rename Variables

Let’s now explore how to rename variables. Begin by taking a look at the column names, a straightforward way of doing so is using the base R’s colnames() function.

colnames(df)

tb %>% colnames()

Now we will select several columns, and then rename them.

names(df)[names(df) == "screen_name"] <- "Screen_Name"

names(df)[names(df) == "retweet_count"] <- "N_Retweets"

head(df[, c("Screen_Name","N_Retweets")], 
     n = 5)

   Screen_Name N_Retweets
1 warmediateam          1
2 warmediateam          2
3 warmediateam          2
4 warmediateam          3
5 warmediateam          3

tb %>% 
  dplyr::select(screen_name, 
                retweet_count) %>%
  dplyr::rename(Screen_Name = screen_name,
                N_Retweets = retweet_count) %>%
  head(n = 5)

# A tibble: 5 x 2
  Screen_Name  N_Retweets
  <chr>             <dbl>
1 warmediateam          1
2 warmediateam          2
3 warmediateam          2
4 warmediateam          3
5 warmediateam          3

18.10 Summarizing

Summary statistics are really useful in describing your data. This can be done both in base R and in dplyr. Here we will explore how to execute simple summaries first; then, we will move on to calculate group summaries.

18.10.1 Simple Summary

To summarize in base R, you will need generic functions to calculate these statistics. For example, mean(), median(), sum(), etc.

# Create a data frame with the summary.
data.frame(rt_avg = mean(df$retweet_count))

    rt_avg
1 245.2969

To summarize data in dplyr, use the summarize() function to compute a requested summary (e.g., mean(), median(), n(), etc.).

tb %>%
  dplyr::summarize(rt_avg = mean(retweet_count))

# A tibble: 1 x 1
  rt_avg
   <dbl>
1   245.

18.10.2 Group Summary

Summarizing can be expanded by computing the statistics by groups. For instance, here we will get the retweet count of each individual; thus, the group would be each screen_name .

Use the aggregate() function to slip the data into subsets and then proceed to compute summary statistics for each.

aggregate(x = df$retweet_count,
          by = list(screen_name = df$screen_name),
          FUN = mean)

Group a data frame with group_by() and perform group operations by adding the summarize() function.

tb %>%
  dplyr::group_by(screen_name) %>%
  dplyr::summarize(rt_avg = mean(retweet_count))

Once again, you may want to combine some functions to make your output more legible.

out <- aggregate(x = df$retweet_count,
                 by = list(screen_name=df$screen_name),
                 FUN = mean)
head(out[order(out$x, decreasing = TRUE), ], 
     n = 5)

        screen_name        x
102      TheEisaAli 4909.646
43         EmmaDaly 4481.640
36       DCrising21 2916.490
97  San_Patricio_BN 1581.717
74      MikeyKayNYC 1506.260

tb %>%
  dplyr::group_by(screen_name) %>%
  dplyr::summarize(rt_avg = mean(retweet_count)) %>%
  arrange(desc(rt_avg)) %>%
  head(n = 5)

# A tibble: 5 x 2
  screen_name     rt_avg
  <chr>            <dbl>
1 TheEisaAli       4910.
2 EmmaDaly         4482.
3 DCrising21       2916.
4 San_Patricio_BN  1582.
5 MikeyKayNYC      1506.

18.11 Combining Data

Up to this point, we have only covered single table functions. However, when data arrives in many pieces you may need to combine these to complete your analysis.

Here we will bring a second data set, which contains social network analysis metrics for each user account listed in the initial data set.

Once again, you can read the new data set with base R or readr.

sna_df <- read.csv("data/SNA_Stats.csv")

sna_tb <- readr::read_csv("data/SNA_Stats.csv")

Inspect both data sets:

dim(sna_df)

[1] 4146    7

colnames(sna_df)

[1] "screen_name"                "componentnumber"           
[3] "Eccentricity"               "closnesscentrality"        
[5] "harmonicclosnesscentrality" "betweenesscentrality"      
[7] "modularity_class"

sna_tb %>% dplyr::dim_desc()

[1] "[4,146 x 7]"

sna_tb %>% colnames()

[1] "screen_name"                "componentnumber"           
[3] "Eccentricity"               "closnesscentrality"        
[5] "harmonicclosnesscentrality" "betweenesscentrality"      
[7] "modularity_class"

In order to connect the tables, a pair of variables called keys are required. One way to identify keys, is to seek for identically named variables in both data sets. This may not always be viable if the keys are named differently in each data set.

Matching data frame names can be accomplished as follows:

intersect(names(df), names(sna_df))

names(tb) %>% dplyr::intersect(names(sna_df))

18.11.1 Left Join

What you see below is a join that retains all rows from the first table.

Use merge() to join. Note that all.x = TRUE tells R to keep all observations from the first table.

merge(df, sna_df, 
      by = "screen_name",
      all.x = TRUE)

The left_join() function retains all rows from the tb data frame, while adding data from sna_tb.

tb %>%
  dplyr::left_join(sna_tb,
                   by = "screen_name")

Take a look at the dimensions of the output.

dim(
  merge(df, sna_df, 
        by = "screen_name",
        all.x = TRUE))

[1] 10542   100

tb %>%
  dplyr::left_join(sna_tb,
                   by = "screen_name") %>%
  dplyr::dim_desc()

[1] "[10,542 x 97]"

18.11.2 Right Join

A right join retains all rows from the second table.

Use merge() to join. Note that all.y = TRUE tells R to keep all observations from the second table.

merge(df, sna_df,
      by = "screen_name",
      all.y = TRUE)

The right_join() function retains all rows from the sna_tb data frame, while adding tb.

tb %>%
  dplyr::right_join(sna_tb, 
                    by = "screen_name")

Take a look at the dimensions of the output.

dim(
  merge(df, sna_df,
      by = "screen_name",
      all.y = TRUE))

[1] 14065   100

tb %>%
  dplyr::right_join(sna_tb, 
                    by = "screen_name") %>%
  dplyr::dim_desc()

[1] "[14,065 x 97]"

18.11.3 Inner Join

An inner join retain rows with matches in both tables. SNA_Stats.csv, for example, may not include isolates so we would expect fewer results after comparing dimensions.

Use the merge() function to inner join.

merge(df, sna_df,
      by = "screen_name")

inner_join() merges based on screen_name.

tb %>%
  dplyr::inner_join(sna_tb, by = "screen_name")

Take a look at the dimensions of the output.

dim(
  merge(df, sna_df,
      by = "screen_name"))

[1] 10027   100

tb %>%
  dplyr::inner_join(sna_tb, by = "screen_name") %>%
  dplyr::dim_desc()

[1] "[10,027 x 97]"

18.11.4 Full Join

A full join retains all rows in both data sets, regardless of matches.

Use the merge() function to full join. Note the all = TRUE argument.

merge(df, sna_df,
      by = "screen_name",
      all = TRUE)

The full_join() function will merge based on shared screen_name.

tb %>%
  dplyr::full_join(sna_tb,
                   by = "screen_name")

Take a look at the dimensions of the output.

dim(
  merge(df, sna_df,
      by = "screen_name",
      all = TRUE))

[1] 14580   100

tb %>%
  dplyr::full_join(sna_tb,
                   by = "screen_name") %>%
  dplyr::dim_desc()

[1] "[14,580 x 97]"

18.12 Piping Multiple Variables

As you can see, the %>% operator is a great way to execute multiple actions in a few lines of code. This last example is meant to show how much can be done by chaining multiple functions with %>%. For example, say we wanted to identify the most retweeted users in our original data set. To do so, we will have to group rows by screen_name using group_by(), then add the number of retweets per account, filter() by the average number of retweets (n = 245), arrange() the total, and finally rename the variables.

tb %>%
  dplyr::group_by(screen_name) %>%
  dplyr::summarise(retweet_total = sum(retweet_count)) %>%
  dplyr::filter(retweet_total >= 245) %>%
  dplyr::arrange(desc(retweet_total)) %>%
  dplyr::rename(Screen_Name = screen_name,
                N_Retweets = retweet_total)

# A tibble: 89 x 2
   Screen_Name     N_Retweets
   <chr>                <dbl>
 1 TheEisaAli          486055
 2 EmmaDaly            448164
 3 DCrising21          291649
 4 MikeyKayNYC         150626
 5 realsohelbahjat     145546
 6 San_Patricio_BN     145518
 7 m_al_asiri           81549
 8 amnesty              80147
 9 Protectthenhs        72937
10 YouTube              62145
# ... with 79 more rows

We can clean up the output to only present the top accounts by adding dplyr’s top_n() function.

tb %>%
  dplyr::group_by(screen_name) %>%
  dplyr::summarise(retweet_total = sum(retweet_count)) %>%
  dplyr::filter(retweet_total >= 245) %>%
  dplyr::arrange(desc(retweet_total)) %>%
  dplyr::rename(Screen_Name = screen_name,
                N_Retweets = retweet_total) %>%
  dplyr::top_n(15)

# A tibble: 15 x 2
   Screen_Name     N_Retweets
   <chr>                <dbl>
 1 TheEisaAli          486055
 2 EmmaDaly            448164
 3 DCrising21          291649
 4 MikeyKayNYC         150626
 5 realsohelbahjat     145546
 6 San_Patricio_BN     145518
 7 m_al_asiri           81549
 8 amnesty              80147
 9 Protectthenhs        72937
10 YouTube              62145
11 nafarrao             44631
12 SulomeAnderson       43000
13 IraqiSecurity        38478
14 USEmbBaghdad         28770
15 The_H16              27462