Separate words in a column that are separated by space and comma into different columns in R-CodePudding

I am manipulating a very large movie data set. The data is as below (example)

Ex:

title<-c("Interstellar", "Back to the Future", "2001: A Space Odyssey", "The Martian")
genre<-c("Adventure, Drama, SciFi ", "Adventure Comedy SciFi", "Adventure, Sci-Fi", "Adventure Drama Sci-Fi")

movies<-data.frame(title, genre)

If you observe in the genre column, certain genres are comma separated and few are space separated. And the word SciFi has two different appearances: SciFi and Sci-Fi. This is my situation in the entire data set that has around 5000 movies.

I am stuck with an appropriate approach for the following results:

How to separate the genre of each movie into individual genres. Ex: I want to separate genre for Interstellar as:

genre1 = Adventure

genre2= Drama

genre3=Sci-Fi

I've used the following command:

movie_genres<-separate(movies, genre, into=c(genre1, genre2, genre3)

The above command is separating the word Sci-Fi as two genres (Sci and Fi or only Sci).

How to delete the hyphen (-) in the word Sci-Fi across the entire genre so that separate function works well.

Is there a work around that adds comma between genres (in genre column) and separate them by comma alone?

CodePudding user response：

I usually start by "cleaning" the data. In this case, I'd make the formatting of your genre column consistent (genres column separated, no trailing spaces, ...) and then use separate.

library(stringr)
library(tidyr)
title<-c("Interstellar", "Back to the Future", "2001: A Space Odyssey", "The Martian")
genre<-c("Adventure, Drama, SciFi ", "Adventure Comedy SciFi", "Adventure, Sci-Fi", "Adventure Drama Sci-Fi")

movies<-data.frame(title, genre)
movies$genre <- str_replace_all(movies$genre, ",\\s ", ",") 
movies$genre <- str_replace_all(movies$genre, "\\s $", "") 
movies$genre <- str_replace_all(movies$genre, "\\s ", ",") 
movies$genre <- str_replace_all(movies$genre, "Sci-Fi", "SciFi")
movies$genre
#> [1] "Adventure,Drama,SciFi"  "Adventure,Comedy,SciFi" "Adventure,SciFi"       
#> [4] "Adventure,Drama,SciFi"
separate(movies, genre, into = c("genre1", "genre2", "genre3"))
#> Warning: Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [3].
#>                   title    genre1 genre2 genre3
#> 1          Interstellar Adventure  Drama  SciFi
#> 2    Back to the Future Adventure Comedy  SciFi
#> 3 2001: A Space Odyssey Adventure  SciFi   <NA>
#> 4           The Martian Adventure  Drama  SciFi

^{Created on 2023-01-31 by the reprex package (v2.0.1)}

CodePudding user response：

What about long format or list columns? Both would let you filter over genres while dealing with multiple misaligned genre columns isn't much fun. For example something like this:

library(dplyr)
library(tidyr)
library(purrr)
library(stringr)
title<-c("Interstellar", "Back to the Future", "2001: A Space Odyssey", "The Martian")
genre<-c("Adventure, Drama, SciFi ", "Adventure Comedy SciFi", "Adventure, Sci-Fi", "Adventure Drama Sci-Fi")
movies<-data.frame(title, genre)

# long format
movies_long <- movies %>% 
  mutate(id = row_number(), .before = 1, genre = str_remove_all(genre, "-") %>% str_squish()) %>%  
  separate_rows(genre, sep = ",? ")

movies_long
#> # A tibble: 11 × 3
#>       id title                 genre    
#>    <int> <chr>                 <chr>    
#>  1     1 Interstellar          Adventure
#>  2     1 Interstellar          Drama    
#>  3     1 Interstellar          SciFi    
#>  4     2 Back to the Future    Adventure
#>  5     2 Back to the Future    Comedy   
#>  6     2 Back to the Future    SciFi    
#>  7     3 2001: A Space Odyssey Adventure
#>  8     3 2001: A Space Odyssey SciFi    
#>  9     4 The Martian           Adventure
#> 10     4 The Martian           Drama    
#> 11     4 The Martian           SciFi

# filter by genre
movies_long %>% filter(genre == "Adventure")
#> # A tibble: 4 × 3
#>      id title                 genre    
#>   <int> <chr>                 <chr>    
#> 1     1 Interstellar          Adventure
#> 2     2 Back to the Future    Adventure
#> 3     3 2001: A Space Odyssey Adventure
#> 4     4 The Martian           Adventure

# list columns, genre column will be filled with lists of genres
movies_lst <- movies %>% 
  mutate(genre = str_remove_all(genre, "-") %>% str_squish() %>% str_split(",? ")) %>% 
  as_tibble()

movies_lst
#> # A tibble: 4 × 2
#>   title                 genre    
#>   <chr>                 <list>   
#> 1 Interstellar          <chr [3]>
#> 2 Back to the Future    <chr [3]>
#> 3 2001: A Space Odyssey <chr [2]>
#> 4 The Martian           <chr [3]>

# can be filtered with e.g. map_lgl and for output can be concatenated to a single string
movies_lst %>% filter(
  map_lgl(genre, ~ all(c("Drama", "SciFi") %in% .x))) %>% 
  mutate(genre = map_chr(genre, ~paste(.x , collapse = ", ")))
#> # A tibble: 2 × 2
#>   title        genre                  
#>   <chr>        <chr>                  
#> 1 Interstellar Adventure, Drama, SciFi
#> 2 The Martian  Adventure, Drama, SciFi

^{Created on 2023-01-31 with reprex v2.0.2}

CodePudding user response：

This approach lists genres separately in their columns and expands automatically when new genres occur.

library(dplyr)
library(tidyr)

movies %>% 
  mutate(genre = strsplit(genre, ", | ")) %>% 
  rowwise() %>% 
  mutate(genre = list(sub("-", "", genre))) %>% 
  unnest(genre) %>% 
  group_by(genre) %>% 
  mutate(grp = cur_group_id()) %>% 
  arrange(grp) %>% 
  pivot_wider(names_from=grp, names_prefix="genre_", values_from=genre)
# A tibble: 4 × 5
  title                 genre_1   genre_2 genre_3 genre_4
  <chr>                 <chr>     <chr>   <chr>   <chr>  
1 Interstellar          Adventure NA      Drama   SciFi  
2 Back to the Future    Adventure Comedy  NA      SciFi  
3 2001: A Space Odyssey Adventure NA      NA      SciFi  
4 The Martian           Adventure NA      Drama   SciFi