How to efficiently convert long data.frame to individual data.frames with 0/1-CodePudding

mydata <- data.frame(id = c(1,1,1,2,2,3,4),
                     hobby = c("music", "sports", "science", "science", "lifestyle", 
                               "party", "sports"),
                     x = c(10, 10, 10, 23, 23, 11, 0),
                     y = c(78, 78, 78, 55, 55, 22, 9))

> mydata
  id     hobby  x  y
1  1     music 10 78
2  1    sports 10 78
3  1   science 10 78
4  2   science 23 55
5  2 lifestyle 23 55
6  3     party 11 22
7  4    sports  0  9

I have a data.frame that's in a long format with 5 different unique hobbies: music, sports, science, lifestyle, and party. What's a quick way in R to obtain 5 data.frames, one for each hobby that's populated with 0/1?

The reason for this is that I want to run the following regression model 5 separate times. One for each unique hobby:

glm(y ~ hobby   offset(x), family = "poisson", data = dat_music))
glm(y ~ hobby   offset(x), family = "poisson", data = dat_sports))
glm(y ~ hobby   offset(x), family = "poisson", data = dat_science))
glm(y ~ hobby   offset(x), family = "poisson", data = dat_lifestyle))
glm(y ~ hobby   offset(x), family = "poisson", data = dat_party))

For each hobby, I want to summarize the data where each row corresponds to a unique id.

For dat_music, I want to have:

  id hobby  x  y
1  1     1 10 78
2  2     0 23 55
3  3     0 11 22
4  4     0  0  9

For dat_sports, I want to have:

  id hobby  x  y
1  1     1 10 78
2  2     0 23 55
3  3     0 11 22
4  4     1  0  9

And so forth? Suppose in reality, I have 50k unique hobbies. What's an efficient way to do this in R?

CodePudding user response：

Here's a purrr/dplyr version:

library(dplyr)
library(purrr)

## group the data in advance
mydata = mydata %>% group_by(id, x, y)

hobbies = unique(mydata$hobby)
results = map(
  .x = set_names(hobbies),
  .f = \(hobby_i) mydata %>% 
    summarize(
      hobby = as.integer(hobby_i %in% hobby),
      .groups = "drop"
    )
)
results
# $music
# # A tibble: 4 × 4
#      id     x     y hobby
#   <dbl> <dbl> <dbl> <int>
# 1     1    10    78     1
# 2     2    23    55     0
# 3     3    11    22     0
# 4     4     0     9     0
# 
# $sports
# # A tibble: 4 × 4
#      id     x     y hobby
#   <dbl> <dbl> <dbl> <int>
# 1     1    10    78     1
# 2     2    23    55     0
# 3     3    11    22     0
# 4     4     0     9     1
# 
# $science
# # A tibble: 4 × 4
#      id     x     y hobby
#   <dbl> <dbl> <dbl> <int>
# 1     1    10    78     1
# 2     2    23    55     1
# 3     3    11    22     0
# 4     4     0     9     0
# 
# $lifestyle
# # A tibble: 4 × 4
#      id     x     y hobby
#   <dbl> <dbl> <dbl> <int>
# 1     1    10    78     0
# 2     2    23    55     1
# 3     3    11    22     0
# 4     4     0     9     0
# 
# $party
# # A tibble: 4 × 4
#      id     x     y hobby
#   <dbl> <dbl> <dbl> <int>
# 1     1    10    78     0
# 2     2    23    55     0
# 3     3    11    22     1
# 4     4     0     9     0