library(dplyr)
df <- read_csv("data.csv")
df |>
group_by(col_name) |>
summarize(mean = mean(col_name), count = n())Grouping & Joining Data (reference card)
Aggregating data
Basic grouping and summarizing
- Aggregation combines rows into groups based on the values in one of more columns and then calculates summary statistics for each group
- First group the data using
group_by()and then usesummarize()to calculate the summary statistics
Grouping by multiple columns
- You can group by multiple columns by passing multiple column names to
group_by()
df |>
group_by(col_name_1, col_name_2) |>
summarize(mean = mean(col_name_3), count = n())Ungrouping
- Sometimes grouping and summarizing results in a data frame that is still grouped
- You can get rid of this grouping using
.groups = "drop"
df |>
group_by(col_name_1, col_name_2) |>
summarize(mean = mean(col_name_3), count = n(), .groups = "drop")Joining tables
Basic joining
- To combine two of more tables we use joins
library(dplyr)
df <- read_csv("data.csv")
df2 <- read_csv("data2.csv")
df3 <- read.csv("data3.csv")
inner_join(df, df2, join_by(col_name))Types of joins
inner_join()keeps only rows that have matching values in both tablesouter_join()keeps all rows from both tablesleft_join()andright_join()keep all rows from the left or right table
Joining multiple tables
- You can join more than two tables by first joining two tables and then joining the result with additional tables
df |>
inner_join(df2, join_by(col_name_1)) |>
inner_join(df3, join_by(col_name_2))Converting between data frames and vectors
Extracting vectors from data frames
- There are three was to extract a vector from a data frame
df$col_name
df[["col_name"]]
pull(df, col_name)Creating a data frame from vectors
- You can create a data frame from vectors using
data.frame()
vector_1 <- c(1, 2, 3)
vector_2 <- c(1.5, 0.8, 2.6)
vector_3 <- c("A", "B", "C")
df <- data.frame(
col_name_1 = vector_1,
col_name_2 = vector_2,
col_name_3 = vector_3
)- To include a column that is filled with a single value pass the value instead of a vector
vector_1 <- c(1, 2, 3)
vector_2 <- c(1.5, 0.8, 2.6)
vector_3 <- c("A", "B", "C")
df <- data.frame(
col_name_1 = vector_1,
col_name_2 = vector_2,
col_name_3 = vector_3,
year = 2026
)