Grouping & Joining Data (reference card)

Aggregating data

Basic grouping and summarizing

  • Aggregation combines rows into groups based on the values in one of more columns and then calculates summary statistics for each group
  • First group the data using group_by() and then use summarize() to calculate the summary statistics
library(dplyr)

df <- read_csv("data.csv")

df |>
  group_by(col_name) |>
  summarize(mean = mean(col_name), count = n())

Grouping by multiple columns

  • You can group by multiple columns by passing multiple column names to group_by()
df |>
  group_by(col_name_1, col_name_2) |>
  summarize(mean = mean(col_name_3), count = n())

Ungrouping

  • Sometimes grouping and summarizing results in a data frame that is still grouped
  • You can get rid of this grouping using ungroup()
df |>
  group_by(col_name_1, col_name_2) |>
  summarize(mean = mean(col_name_3), count = n(),) |>
  ungroup()

Joining tables

Basic joining

  • To combine two of more tables we use joins
library(dplyr)

df <- read_csv("data.csv")
df2 <- read_csv("data2.csv")
df3 <- read.csv("data3.csv")

inner_join(df, df2, by = "col_name")

Types of joins

  • inner_join() keeps only rows that have matching values in both tables
  • outer_join() keeps all rows from both tables
  • left_join() and right_join() keep all rows from the left or right table

Joining multiple tables

  • You can join more than two tables by first joining two tables and then joining the result with additional tables
df |>
  inner_join(df2, by = "col_name_1") |>
  inner_join(df3, by = "col_name_2")

Converting between data frames and vectors

Extracting vectors from data frames

  • There are three was to extract a vector from a data frame
df$col_name

df[["col_name"]]

pull(df, col_name)

Creating a data frame from vectors

  • You can create a data frame from vectors using data.frame()
vector_1 <- c(1, 2, 3)
vector_2 <- c(1.5, 0.8, 2.6)
vector_3 <- c("A", "B", "C")

df <- data.frame(
  col_name_1 = vector_1,
  col_name_2 = vector_2,
  col_name_3 = vector_3
)
  • To include a column that is filled with a single value pass the value instead of a vector
vector_1 <- c(1, 2, 3)
vector_2 <- c(1.5, 0.8, 2.6)
vector_3 <- c("A", "B", "C")

df <- data.frame(
  col_name_1 = vector_1,
  col_name_2 = vector_2,
  col_name_3 = vector_3,
  year = 2026
)