Modeling in Groups Using the dplyr group_modify Function

Nov 27, 2020 7 min read notes

The group_by() function in dplyr adds a “groups” attribute to a tibble:

library(tidyverse)

msleep

## # A tibble: 83 x 11
##    name  genus vore  order conservation sleep_total sleep_rem sleep_cycle awake
##    <chr> <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl> <dbl>
##  1 Chee… Acin… carni Carn… lc                  12.1      NA        NA      11.9
##  2 Owl … Aotus omni  Prim… <NA>                17         1.8      NA       7  
##  3 Moun… Aplo… herbi Rode… nt                  14.4       2.4      NA       9.6
##  4 Grea… Blar… omni  Sori… lc                  14.9       2.3       0.133   9.1
##  5 Cow   Bos   herbi Arti… domesticated         4         0.7       0.667  20  
##  6 Thre… Brad… herbi Pilo… <NA>                14.4       2.2       0.767   9.6
##  7 Nort… Call… carni Carn… vu                   8.7       1.4       0.383  15.3
##  8 Vesp… Calo… <NA>  Rode… <NA>                 7        NA        NA      17  
##  9 Dog   Canis carni Carn… domesticated        10.1       2.9       0.333  13.9
## 10 Roe … Capr… herbi Arti… lc                   3        NA        NA      21  
## # … with 73 more rows, and 2 more variables: brainwt <dbl>, bodywt <dbl>

grpd_msleep <-
  msleep %>%
  group_by(vore)

grpd_msleep

## # A tibble: 83 x 11
## # Groups:   vore [5]
##    name  genus vore  order conservation sleep_total sleep_rem sleep_cycle awake
##    <chr> <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl> <dbl>
##  1 Chee… Acin… carni Carn… lc                  12.1      NA        NA      11.9
##  2 Owl … Aotus omni  Prim… <NA>                17         1.8      NA       7  
##  3 Moun… Aplo… herbi Rode… nt                  14.4       2.4      NA       9.6
##  4 Grea… Blar… omni  Sori… lc                  14.9       2.3       0.133   9.1
##  5 Cow   Bos   herbi Arti… domesticated         4         0.7       0.667  20  
##  6 Thre… Brad… herbi Pilo… <NA>                14.4       2.2       0.767   9.6
##  7 Nort… Call… carni Carn… vu                   8.7       1.4       0.383  15.3
##  8 Vesp… Calo… <NA>  Rode… <NA>                 7        NA        NA      17  
##  9 Dog   Canis carni Carn… domesticated        10.1       2.9       0.333  13.9
## 10 Roe … Capr… herbi Arti… lc                   3        NA        NA      21  
## # … with 73 more rows, and 2 more variables: brainwt <dbl>, bodywt <dbl>

attributes(grpd_msleep)$groups

## # A tibble: 5 x 2
##   vore          .rows
## * <chr>   <list<int>>
## 1 carni          [19]
## 2 herbi          [32]
## 3 insecti         [5]
## 4 omni           [20]
## 5 <NA>            [7]

You can compute summary measures like mean() and sd() of values in other columns in the tibble within groups. Another use-case would be to run statistical tests or models within each grouping. The group_modify() function in dplyr makes it relatively easy to do that.

The following will use the Titanic data set to demonstrate how to run a logistic regression model with group_modify().

First, the Titanic data needs to be converted from a table object to a tibble.

Titanic %>%
  ## convert to tibble
  as_tibble()

## # A tibble: 32 x 5
##    Class Sex    Age   Survived     n
##    <chr> <chr>  <chr> <chr>    <dbl>
##  1 1st   Male   Child No           0
##  2 2nd   Male   Child No           0
##  3 3rd   Male   Child No          35
##  4 Crew  Male   Child No           0
##  5 1st   Female Child No           0
##  6 2nd   Female Child No           0
##  7 3rd   Female Child No          17
##  8 Crew  Female Child No           0
##  9 1st   Male   Adult No         118
## 10 2nd   Male   Adult No         154
## # … with 22 more rows

Next, we need to convert the data from a summary of counts per group to one row per observation. Here we’ll use uncount() from the tidyr package, using the variable “n” as the weight (i.e. number of times to repeat each row).

Titanic %>%
  ## convert to tibble
  as_tibble() %>%
  ## expand counts
  uncount(n)

## # A tibble: 2,201 x 4
##    Class Sex   Age   Survived
##    <chr> <chr> <chr> <chr>   
##  1 3rd   Male  Child No      
##  2 3rd   Male  Child No      
##  3 3rd   Male  Child No      
##  4 3rd   Male  Child No      
##  5 3rd   Male  Child No      
##  6 3rd   Male  Child No      
##  7 3rd   Male  Child No      
##  8 3rd   Male  Child No      
##  9 3rd   Male  Child No      
## 10 3rd   Male  Child No      
## # … with 2,191 more rows

For this example we will run logistic regression to ascertain odds of survival by sex, stratified by passenger class. In order to make sure we’re interpreting the output correctly, we coerce the “Survived” and “Sex” columns to factors and explicitly define the reference levels.

Note that this code uses as_factor() and fct_relevel() from forcats.

Titanic %>%
  ## convert to tibble
  as_tibble() %>%
  ## expand counts
  uncount(n) %>%
  ## convert 'Survived' column to factor and relevel
  ## same for `Sex`
  mutate(Survived = as_factor(Survived),
         Survived = fct_relevel(Survived, c("Yes","No")),
         Sex = as_factor(Sex),
         Sex = fct_relevel(Sex, c("Female", "Male")))

## # A tibble: 2,201 x 4
##    Class Sex   Age   Survived
##    <chr> <fct> <chr> <fct>   
##  1 3rd   Male  Child No      
##  2 3rd   Male  Child No      
##  3 3rd   Male  Child No      
##  4 3rd   Male  Child No      
##  5 3rd   Male  Child No      
##  6 3rd   Male  Child No      
##  7 3rd   Male  Child No      
##  8 3rd   Male  Child No      
##  9 3rd   Male  Child No      
## 10 3rd   Male  Child No      
## # … with 2,191 more rows

Next we’ll group by the “Class” variable. The logistic regression to follow will be performed independently on each group of the data.

Titanic %>%
  ## convert to tibble
  as_tibble() %>%
  ## expand counts
  uncount(n) %>%
  ## convert 'Survived' column to factor and relevel
  ## same for `Sex`
  mutate(Survived = as_factor(Survived),
         Survived = fct_relevel(Survived, c("Yes","No")),
         Sex = as_factor(Sex),
         Sex = fct_relevel(Sex, c("Female", "Male"))) %>%
  ## group by 'Class'
  group_by(Class)

## # A tibble: 2,201 x 4
## # Groups:   Class [4]
##    Class Sex   Age   Survived
##    <chr> <fct> <chr> <fct>   
##  1 3rd   Male  Child No      
##  2 3rd   Male  Child No      
##  3 3rd   Male  Child No      
##  4 3rd   Male  Child No      
##  5 3rd   Male  Child No      
##  6 3rd   Male  Child No      
##  7 3rd   Male  Child No      
##  8 3rd   Male  Child No      
##  9 3rd   Male  Child No      
## 10 3rd   Male  Child No      
## # … with 2,191 more rows

With the grouping in place we can run the logistic regression with glm(..., family = binomial). The “Survived” variable is the response and “Sex” is the predictor. Again, this will operate independently on each group (passenger class) in the data. The .f argument to group_modify() can accept either a function or formula notation. In this case we use the formula (with ~) because that allows us to refer to the subset of the rows in each “Class” group.

There are four levels of the “Class” variable, and therefore the output of the glm() will be four different model objects. When wrapped around this output, the tidy() function from broom stacks the summary of coefficients from each of these models on top of one another in a tibble.

Titanic %>%
  ## convert to tibble
  as_tibble() %>%
  ## expand counts
  uncount(n) %>%
  ## convert 'Survived' column to factor and relevel
  ## same for `Sex`
  mutate(Survived = as_factor(Survived),
         Survived = fct_relevel(Survived, c("Yes","No")),
         Sex = as_factor(Sex),
         Sex = fct_relevel(Sex, c("Female", "Male"))) %>%
  ## group by 'Class'
  group_by(Class) %>%
  ## run logistic regression for survival status by sex within classes
  group_modify(.f = ~ broom::tidy(glm(Survived ~ Sex, data = .x, family = binomial)))

## # A tibble: 8 x 6
## # Groups:   Class [4]
##   Class term        estimate std.error statistic  p.value
##   <chr> <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 1st   (Intercept)   -3.56      0.507     -7.03 2.13e-12
## 2 1st   SexMale        4.21      0.531      7.92 2.29e-15
## 3 2nd   (Intercept)   -1.97      0.296     -6.65 3.03e-11
## 4 2nd   SexMale        3.79      0.366     10.3  4.88e-25
## 5 3rd   (Intercept)    0.164     0.143      1.14 2.54e- 1
## 6 3rd   SexMale        1.40      0.185      7.58 3.36e-14
## 7 Crew  (Intercept)   -1.90      0.619     -3.06 2.18e- 3
## 8 Crew  SexMale        3.15      0.625      5.04 4.68e- 7

Finally we’ll exponentiate the beta coefficients to get the odds ratio. Then do a little bit of clean up to filter out the estimates for the intercept terms and only select the “Class” and odds ratio (“OR”) columns.

Titanic %>%
  ## convert to tibble
  as_tibble() %>%
  ## expand counts
  uncount(n) %>%
  ## convert 'Survived' column to factor and relevel
  ## same for `Sex`
  mutate(Survived = as_factor(Survived),
         Survived = fct_relevel(Survived, c("Yes","No")),
         Sex = as_factor(Sex),
         Sex = fct_relevel(Sex, c("Female", "Male"))) %>%
  ## group by 'Class'
  group_by(Class) %>%
  ## run logistic regression for survival status by sex within classes
  group_modify(~ broom::tidy(glm(Survived ~ Sex, data = .x, family = binomial))) %>%
  ## exponentiate beta coefficient to get odds ratio
  mutate(OR = exp(estimate)) %>%
  ## clean up
  filter(term != "(Intercept)") %>%
  select(Class, OR)

## # A tibble: 4 x 2
## # Groups:   Class [4]
##   Class    OR
##   <chr> <dbl>
## 1 1st   67.1 
## 2 2nd   44.1 
## 3 3rd    4.07
## 4 Crew  23.3

r dplyr

Modeling in Groups Using the dplyr group_modify Function

Related