library(dplyr)
<- read_csv("data.csv")
df
|>
df group_by(col_name) |>
summarize(mean = mean(col_name), count = n())
Grouping & Joining Data (reference card)
Aggregating data
Basic grouping and summarizing
- Aggregation combines rows into groups based on the values in one of more columns and then calculates summary statistics for each group
- First group the data using
group_by()
and then usesummarize()
to calculate the summary statistics
Grouping by multiple columns
- You can group by multiple columns by passing multiple column names to
group_by()
|>
df group_by(col_name_1, col_name_2) |>
summarize(mean = mean(col_name_3), count = n())
Ungrouping
- Sometimes grouping and summarizing results in a data frame that is still grouped
- You can get rid of this grouping using
ungroup()
|>
df group_by(col_name_1, col_name_2) |>
summarize(mean = mean(col_name_3), count = n(),) |>
ungroup()
Joining tables
Basic joining
- To combine two of more tables we use joins
library(dplyr)
<- read_csv("data.csv")
df <- read_csv("data2.csv")
df2 <- read.csv("data3.csv")
df3
inner_join(df, df2, by = "col_name")
Types of joins
inner_join()
keeps only rows that have matching values in both tablesouter_join()
keeps all rows from both tablesleft_join()
andright_join()
keep all rows from the left or right table
Joining multiple tables
- You can join more than two tables by first joining two tables and then joining the result with additional tables
|>
df inner_join(df2, by = "col_name_1") |>
inner_join(df3, by = "col_name_2")
Converting between data frames and vectors
Extracting vectors from data frames
- There are three was to extract a vector from a data frame
$col_name
df
"col_name"]]
df[[
pull(df, col_name)
Creating a data frame from vectors
- You can create a data frame from vectors using
data.frame()
<- c(1, 2, 3)
vector_1 <- c(1.5, 0.8, 2.6)
vector_2 <- c("A", "B", "C")
vector_3
<- data.frame(
df col_name_1 = vector_1,
col_name_2 = vector_2,
col_name_3 = vector_3
)
- To include a column that is filled with a single value pass the value instead of a vector
<- c(1, 2, 3)
vector_1 <- c(1.5, 0.8, 2.6)
vector_2 <- c("A", "B", "C")
vector_3
<- data.frame(
df col_name_1 = vector_1,
col_name_2 = vector_2,
col_name_3 = vector_3,
year = 2026
)