Data in Tables (reference card)

Reading data

CSV data

library(readr)

df <- read_csv("data.csv")

TSV data

library(readr)

df <- read_tsv("data.tsv")

Properly reading in null values

By default read_csv and read_tsv read in empty cells and cells with "NA" as NA values. You can change which values are read in as NA using the na argument:

library(readr)

# Only read "-999" as NA
df <- read_tsv("data.tsv", na = c("-999"))

# Read in empty values, "NA", & "-999" as NA
df <- read_tsv("data.tsv", na = c("", "NA", "-999"))

Basic dplyr

library(dplyr)

Select columns (select)

df_with_selected_columns <- select(df, col_name_1, col_name_2)

Add a new columns (mutate)

df_with_new_column <- mutate(df, col_name_3 = col_name_1 * col_name_2)

Sort rows (arrange)

Sort ascending by col_name_1 and then descending by col_name_2:

sorted_df <- arrange(df, col_name_1, desc(col_name_2))

Filter out rows not matching conditions (filter)

One condition

filtered_df <- filter(df, col_name_2 == "A")
filtered_df <- filter(df, col_name_1 > 5)

More than one condition (and)

Keep rows where col_name_2 is "A" and col_name_1 is greater than 5:

filtered_df <- filter(df, col_name_2 == "A", col_name_1 > 5)

More than one condition (or)

Keep only rows where col_name_2 is either "A" or "B":

filtered_df <- filter(df, col_name_2 == "A" | col_name_2 == "B"))

Remove null values (NA)

To create a table without rows that have NA’s in any column:

df_no_na <- drop_na(df)

To only drop rows that have NA’s in specific columns also list the names of those columns:

df_no_na <- drop_na(df, col_name_to_drop_if_na)

Combining data manipulations

Intermediate variables

Store each step in a new variable and then use that variable name in the next step.

selected_df <- select(df, col_name_1, col_name2)
filtered_selected_df <- filter(selected_df, col_name_1 == "A")
no_na_filtered_selected_df <- drop_na(filtered_selected_df)

Pipes

Pipes (|>) pass the output from command on the left of the pipe as the first argument to the function on the right of the pipe. So instead of storing results in intermediate variables we can pass them through pipes.

no_na_filtered_selected_df <- df |>
  select(col_name_1, col_name2) |>
  filter(col_name_1 == "A") |>
  drop_na()