I have got a dataframe that contains authors and their genre as well as other variables.
The problem I am facing is that the genre variable is a single character string like "fantasy, fiction, romance". Thereby I can not group the dataframe by genre as it groups for each combination of genres.
I have tried an in my opinion way to complex approach, where I used stringr::str_split(",") to split the strings up. The results was a list, which I transformed into factor-type. This gave me a list of factor that I used to redefine my genre column. The result is still not usable for my analysis. So I am wondering which other approach makes the most sense.
This is the minimized version of my dataset.
df <- data.frame(author= c("w","x","y","z"),
genre= c("a,b,c","a","c,d","a,b,c,d"))
What I am looking for is grouped genre variable, in an output like this:
> df
genre n
1 a 3
2 b 2
3 c 3
4 d 2
What I get is:
df %>% group_by(genre) %>% summarise(n=n())
# A tibble: 4 x 2
genre n
<chr> <int>
1 a 1
2 a,b,c 1
3 a,b,c,d 1
4 c,d 1
As you can see, the strings are not split up.
What I tried was:
df <- data.frame(autor= c("w","x","y","z"),
genre= c("a,b,c","a","c,d","a,b,c,d"))
df$genre<-df$genre %>% map(str_split,",") %>% map(as.factor) %>% unlist()
which did not bring me any further, as I am still not able to group by genre, in a way that makes sense.
After group_by(genre) i get this output:
> df %>% group_by(genre) %>% summarise(n=n())
# A tibble: 4 x 2
genre n
<fct> <int>
1 "c(\"a\", \"b\", \"c\")" 1
2 "a" 1
3 "c(\"c\", \"d\")" 1
4 "c(\"a\", \"b\", \"c\", \"d\")" 1
The problem does not seem to be to complex, but unfortunately I can not quite figure it out.
I am really looking forward to hear about different approaches and any advice
CodePudding user response:
Using tidyr::separate_rows()
:
library(tidyr)
library(dplyr)
df %>%
separate_rows(genre) %>%
count(genre)
# A tibble: 4 × 2
genre n
<chr> <int>
1 a 3
2 b 2
3 c 3
4 d 2
CodePudding user response:
We can first strsplit
the variable, then count
it:
df %>%
mutate(genre = strsplit(genre, ",")) %>%
unnest(genre) %>%
count(genre)
# # A tibble: 4 x 2
# genre n
# <chr> <int>
# 1 a 3
# 2 b 2
# 3 c 3
# 4 d 2
If you prefer, you can replace strsplit
with stringr::str_split
for the same results.