Home > other >  Spliting a String inside a column, but keeping it in a format usable for further dplyr analysis / R
Spliting a String inside a column, but keeping it in a format usable for further dplyr analysis / R

Time:12-09

I have got a dataframe that contains authors and their genre as well as other variables.

The problem I am facing is that the genre variable is a single character string like "fantasy, fiction, romance". Thereby I can not group the dataframe by genre as it groups for each combination of genres.

I have tried an in my opinion way to complex approach, where I used stringr::str_split(",") to split the strings up. The results was a list, which I transformed into factor-type. This gave me a list of factor that I used to redefine my genre column. The result is still not usable for my analysis. So I am wondering which other approach makes the most sense.

This is the minimized version of my dataset.

df <- data.frame(author= c("w","x","y","z"),
                 genre= c("a,b,c","a","c,d","a,b,c,d"))

What I am looking for is grouped genre variable, in an output like this:

> df
  genre n
1     a 3
2     b 2
3     c 3
4     d 2

What I get is:

df %>% group_by(genre) %>% summarise(n=n())
# A tibble: 4 x 2
  genre       n
  <chr>   <int>
1 a           1
2 a,b,c       1
3 a,b,c,d     1
4 c,d         1

As you can see, the strings are not split up.

What I tried was:

df <- data.frame(autor= c("w","x","y","z"),
                 genre= c("a,b,c","a","c,d","a,b,c,d"))
df$genre<-df$genre %>% map(str_split,",") %>% map(as.factor) %>% unlist()

which did not bring me any further, as I am still not able to group by genre, in a way that makes sense.

After group_by(genre) i get this output:

> df %>% group_by(genre) %>% summarise(n=n())
# A tibble: 4 x 2
  genre                               n
  <fct>                           <int>
1 "c(\"a\", \"b\", \"c\")"            1
2 "a"                                 1
3 "c(\"c\", \"d\")"                   1
4 "c(\"a\", \"b\", \"c\", \"d\")"     1

The problem does not seem to be to complex, but unfortunately I can not quite figure it out.

I am really looking forward to hear about different approaches and any advice

CodePudding user response:

Using tidyr::separate_rows():

library(tidyr)
library(dplyr)

df %>%
  separate_rows(genre) %>%
  count(genre)
# A tibble: 4 × 2
  genre     n
  <chr> <int>
1 a         3
2 b         2
3 c         3
4 d         2

CodePudding user response:

We can first strsplit the variable, then count it:

df %>%
  mutate(genre = strsplit(genre, ",")) %>%
  unnest(genre) %>%
  count(genre)
# # A tibble: 4 x 2
#   genre     n
#   <chr> <int>
# 1 a         3
# 2 b         2
# 3 c         3
# 4 d         2

If you prefer, you can replace strsplit with stringr::str_split for the same results.

  • Related