mydata <- data.frame(id = c(1,1,1,2,2,3,4),
hobby = c("music", "sports", "science", "science", "lifestyle",
"party", "sports"),
x = c(10, 10, 10, 23, 23, 11, 0),
y = c(78, 78, 78, 55, 55, 22, 9))
> mydata
id hobby x y
1 1 music 10 78
2 1 sports 10 78
3 1 science 10 78
4 2 science 23 55
5 2 lifestyle 23 55
6 3 party 11 22
7 4 sports 0 9
I have a data.frame that's in a long format with 5 different unique hobbies: music, sports, science, lifestyle, and party. What's a quick way in R to obtain 5 data.frames, one for each hobby that's populated with 0/1?
The reason for this is that I want to run the following regression model 5 separate times. One for each unique hobby:
glm(y ~ hobby offset(x), family = "poisson", data = dat_music))
glm(y ~ hobby offset(x), family = "poisson", data = dat_sports))
glm(y ~ hobby offset(x), family = "poisson", data = dat_science))
glm(y ~ hobby offset(x), family = "poisson", data = dat_lifestyle))
glm(y ~ hobby offset(x), family = "poisson", data = dat_party))
For each hobby, I want to summarize the data where each row corresponds to a unique id
.
For dat_music
, I want to have:
id hobby x y
1 1 1 10 78
2 2 0 23 55
3 3 0 11 22
4 4 0 0 9
For dat_sports
, I want to have:
id hobby x y
1 1 1 10 78
2 2 0 23 55
3 3 0 11 22
4 4 1 0 9
And so forth? Suppose in reality, I have 50k unique hobbies. What's an efficient way to do this in R?
CodePudding user response:
Here's a purrr
/dplyr
version:
library(dplyr)
library(purrr)
## group the data in advance
mydata = mydata %>% group_by(id, x, y)
hobbies = unique(mydata$hobby)
results = map(
.x = set_names(hobbies),
.f = \(hobby_i) mydata %>%
summarize(
hobby = as.integer(hobby_i %in% hobby),
.groups = "drop"
)
)
results
# $music
# # A tibble: 4 × 4
# id x y hobby
# <dbl> <dbl> <dbl> <int>
# 1 1 10 78 1
# 2 2 23 55 0
# 3 3 11 22 0
# 4 4 0 9 0
#
# $sports
# # A tibble: 4 × 4
# id x y hobby
# <dbl> <dbl> <dbl> <int>
# 1 1 10 78 1
# 2 2 23 55 0
# 3 3 11 22 0
# 4 4 0 9 1
#
# $science
# # A tibble: 4 × 4
# id x y hobby
# <dbl> <dbl> <dbl> <int>
# 1 1 10 78 1
# 2 2 23 55 1
# 3 3 11 22 0
# 4 4 0 9 0
#
# $lifestyle
# # A tibble: 4 × 4
# id x y hobby
# <dbl> <dbl> <dbl> <int>
# 1 1 10 78 0
# 2 2 23 55 1
# 3 3 11 22 0
# 4 4 0 9 0
#
# $party
# # A tibble: 4 × 4
# id x y hobby
# <dbl> <dbl> <dbl> <int>
# 1 1 10 78 0
# 2 2 23 55 0
# 3 3 11 22 1
# 4 4 0 9 0