The group_by()
function in dplyr adds a “groups” attribute to a tibble
:
library(tidyverse)
msleep
## # A tibble: 83 x 11
## name genus vore order conservation sleep_total sleep_rem sleep_cycle awake
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Chee… Acin… carni Carn… lc 12.1 NA NA 11.9
## 2 Owl … Aotus omni Prim… <NA> 17 1.8 NA 7
## 3 Moun… Aplo… herbi Rode… nt 14.4 2.4 NA 9.6
## 4 Grea… Blar… omni Sori… lc 14.9 2.3 0.133 9.1
## 5 Cow Bos herbi Arti… domesticated 4 0.7 0.667 20
## 6 Thre… Brad… herbi Pilo… <NA> 14.4 2.2 0.767 9.6
## 7 Nort… Call… carni Carn… vu 8.7 1.4 0.383 15.3
## 8 Vesp… Calo… <NA> Rode… <NA> 7 NA NA 17
## 9 Dog Canis carni Carn… domesticated 10.1 2.9 0.333 13.9
## 10 Roe … Capr… herbi Arti… lc 3 NA NA 21
## # … with 73 more rows, and 2 more variables: brainwt <dbl>, bodywt <dbl>
grpd_msleep <-
msleep %>%
group_by(vore)
grpd_msleep
## # A tibble: 83 x 11
## # Groups: vore [5]
## name genus vore order conservation sleep_total sleep_rem sleep_cycle awake
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Chee… Acin… carni Carn… lc 12.1 NA NA 11.9
## 2 Owl … Aotus omni Prim… <NA> 17 1.8 NA 7
## 3 Moun… Aplo… herbi Rode… nt 14.4 2.4 NA 9.6
## 4 Grea… Blar… omni Sori… lc 14.9 2.3 0.133 9.1
## 5 Cow Bos herbi Arti… domesticated 4 0.7 0.667 20
## 6 Thre… Brad… herbi Pilo… <NA> 14.4 2.2 0.767 9.6
## 7 Nort… Call… carni Carn… vu 8.7 1.4 0.383 15.3
## 8 Vesp… Calo… <NA> Rode… <NA> 7 NA NA 17
## 9 Dog Canis carni Carn… domesticated 10.1 2.9 0.333 13.9
## 10 Roe … Capr… herbi Arti… lc 3 NA NA 21
## # … with 73 more rows, and 2 more variables: brainwt <dbl>, bodywt <dbl>
attributes(grpd_msleep)$groups
## # A tibble: 5 x 2
## vore .rows
## * <chr> <list<int>>
## 1 carni [19]
## 2 herbi [32]
## 3 insecti [5]
## 4 omni [20]
## 5 <NA> [7]
You can compute summary measures like mean()
and sd()
of values in other columns in the tibble
within groups. Another use-case would be to run statistical tests or models within each grouping. The group_modify()
function in dplyr makes it relatively easy to do that.
The following will use the Titanic
data set to demonstrate how to run a logistic regression model with group_modify()
.
First, the Titanic
data needs to be converted from a table
object to a tibble
.
Titanic %>%
## convert to tibble
as_tibble()
## # A tibble: 32 x 5
## Class Sex Age Survived n
## <chr> <chr> <chr> <chr> <dbl>
## 1 1st Male Child No 0
## 2 2nd Male Child No 0
## 3 3rd Male Child No 35
## 4 Crew Male Child No 0
## 5 1st Female Child No 0
## 6 2nd Female Child No 0
## 7 3rd Female Child No 17
## 8 Crew Female Child No 0
## 9 1st Male Adult No 118
## 10 2nd Male Adult No 154
## # … with 22 more rows
Next, we need to convert the data from a summary of counts per group to one row per observation. Here we’ll use uncount()
from the tidyr package, using the variable “n” as the weight (i.e. number of times to repeat each row).
Titanic %>%
## convert to tibble
as_tibble() %>%
## expand counts
uncount(n)
## # A tibble: 2,201 x 4
## Class Sex Age Survived
## <chr> <chr> <chr> <chr>
## 1 3rd Male Child No
## 2 3rd Male Child No
## 3 3rd Male Child No
## 4 3rd Male Child No
## 5 3rd Male Child No
## 6 3rd Male Child No
## 7 3rd Male Child No
## 8 3rd Male Child No
## 9 3rd Male Child No
## 10 3rd Male Child No
## # … with 2,191 more rows
For this example we will run logistic regression to ascertain odds of survival by sex, stratified by passenger class. In order to make sure we’re interpreting the output correctly, we coerce the “Survived” and “Sex” columns to factors and explicitly define the reference levels.
Note that this code uses as_factor()
and fct_relevel()
from forcats.
Titanic %>%
## convert to tibble
as_tibble() %>%
## expand counts
uncount(n) %>%
## convert 'Survived' column to factor and relevel
## same for `Sex`
mutate(Survived = as_factor(Survived),
Survived = fct_relevel(Survived, c("Yes","No")),
Sex = as_factor(Sex),
Sex = fct_relevel(Sex, c("Female", "Male")))
## # A tibble: 2,201 x 4
## Class Sex Age Survived
## <chr> <fct> <chr> <fct>
## 1 3rd Male Child No
## 2 3rd Male Child No
## 3 3rd Male Child No
## 4 3rd Male Child No
## 5 3rd Male Child No
## 6 3rd Male Child No
## 7 3rd Male Child No
## 8 3rd Male Child No
## 9 3rd Male Child No
## 10 3rd Male Child No
## # … with 2,191 more rows
Next we’ll group by the “Class” variable. The logistic regression to follow will be performed independently on each group of the data.
Titanic %>%
## convert to tibble
as_tibble() %>%
## expand counts
uncount(n) %>%
## convert 'Survived' column to factor and relevel
## same for `Sex`
mutate(Survived = as_factor(Survived),
Survived = fct_relevel(Survived, c("Yes","No")),
Sex = as_factor(Sex),
Sex = fct_relevel(Sex, c("Female", "Male"))) %>%
## group by 'Class'
group_by(Class)
## # A tibble: 2,201 x 4
## # Groups: Class [4]
## Class Sex Age Survived
## <chr> <fct> <chr> <fct>
## 1 3rd Male Child No
## 2 3rd Male Child No
## 3 3rd Male Child No
## 4 3rd Male Child No
## 5 3rd Male Child No
## 6 3rd Male Child No
## 7 3rd Male Child No
## 8 3rd Male Child No
## 9 3rd Male Child No
## 10 3rd Male Child No
## # … with 2,191 more rows
With the grouping in place we can run the logistic regression with glm(..., family = binomial)
. The “Survived” variable is the response and “Sex” is the predictor. Again, this will operate independently on each group (passenger class) in the data. The .f
argument to group_modify()
can accept either a function or formula notation. In this case we use the formula (with ~
) because that allows us to refer to the subset of the rows in each “Class” group.
There are four levels of the “Class” variable, and therefore the output of the glm()
will be four different model objects. When wrapped around this output, the tidy()
function from broom stacks the summary of coefficients from each of these models on top of one another in a tibble
.
Titanic %>%
## convert to tibble
as_tibble() %>%
## expand counts
uncount(n) %>%
## convert 'Survived' column to factor and relevel
## same for `Sex`
mutate(Survived = as_factor(Survived),
Survived = fct_relevel(Survived, c("Yes","No")),
Sex = as_factor(Sex),
Sex = fct_relevel(Sex, c("Female", "Male"))) %>%
## group by 'Class'
group_by(Class) %>%
## run logistic regression for survival status by sex within classes
group_modify(.f = ~ broom::tidy(glm(Survived ~ Sex, data = .x, family = binomial)))
## # A tibble: 8 x 6
## # Groups: Class [4]
## Class term estimate std.error statistic p.value
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1st (Intercept) -3.56 0.507 -7.03 2.13e-12
## 2 1st SexMale 4.21 0.531 7.92 2.29e-15
## 3 2nd (Intercept) -1.97 0.296 -6.65 3.03e-11
## 4 2nd SexMale 3.79 0.366 10.3 4.88e-25
## 5 3rd (Intercept) 0.164 0.143 1.14 2.54e- 1
## 6 3rd SexMale 1.40 0.185 7.58 3.36e-14
## 7 Crew (Intercept) -1.90 0.619 -3.06 2.18e- 3
## 8 Crew SexMale 3.15 0.625 5.04 4.68e- 7
Finally we’ll exponentiate the beta coefficients to get the odds ratio. Then do a little bit of clean up to filter out the estimates for the intercept terms and only select the “Class” and odds ratio (“OR”) columns.
Titanic %>%
## convert to tibble
as_tibble() %>%
## expand counts
uncount(n) %>%
## convert 'Survived' column to factor and relevel
## same for `Sex`
mutate(Survived = as_factor(Survived),
Survived = fct_relevel(Survived, c("Yes","No")),
Sex = as_factor(Sex),
Sex = fct_relevel(Sex, c("Female", "Male"))) %>%
## group by 'Class'
group_by(Class) %>%
## run logistic regression for survival status by sex within classes
group_modify(~ broom::tidy(glm(Survived ~ Sex, data = .x, family = binomial))) %>%
## exponentiate beta coefficient to get odds ratio
mutate(OR = exp(estimate)) %>%
## clean up
filter(term != "(Intercept)") %>%
select(Class, OR)
## # A tibble: 4 x 2
## # Groups: Class [4]
## Class OR
## <chr> <dbl>
## 1 1st 67.1
## 2 2nd 44.1
## 3 3rd 4.07
## 4 Crew 23.3