I am manipulating a very large movie data set. The data is as below (example)
Ex:
title<-c("Interstellar", "Back to the Future", "2001: A Space Odyssey", "The Martian")
genre<-c("Adventure, Drama, SciFi ", "Adventure Comedy SciFi", "Adventure, Sci-Fi", "Adventure Drama Sci-Fi")
movies<-data.frame(title, genre)
If you observe in the genre column, certain genres are comma separated and few are space separated. And the word SciFi has two different appearances: SciFi and Sci-Fi. This is my situation in the entire data set that has around 5000 movies.
I am stuck with an appropriate approach for the following results:
- How to separate the genre of each movie into individual genres. Ex: I want to separate genre for Interstellar as:
genre1 = Adventure
genre2= Drama
genre3=Sci-Fi
I've used the following command:
movie_genres<-separate(movies, genre, into=c(genre1, genre2, genre3)
The above command is separating the word Sci-Fi as two genres (Sci and Fi or only Sci).
- How to delete the hyphen (-) in the word Sci-Fi across the entire genre so that separate function works well.
OR
- Is there a work around that adds comma between genres (in genre column) and separate them by comma alone?
CodePudding user response:
I usually start by "cleaning" the data. In this case, I'd make the formatting of your genre column consistent (genres column separated, no trailing spaces, ...) and then use separate.
library(stringr)
library(tidyr)
title<-c("Interstellar", "Back to the Future", "2001: A Space Odyssey", "The Martian")
genre<-c("Adventure, Drama, SciFi ", "Adventure Comedy SciFi", "Adventure, Sci-Fi", "Adventure Drama Sci-Fi")
movies<-data.frame(title, genre)
movies$genre <- str_replace_all(movies$genre, ",\\s ", ",")
movies$genre <- str_replace_all(movies$genre, "\\s $", "")
movies$genre <- str_replace_all(movies$genre, "\\s ", ",")
movies$genre <- str_replace_all(movies$genre, "Sci-Fi", "SciFi")
movies$genre
#> [1] "Adventure,Drama,SciFi" "Adventure,Comedy,SciFi" "Adventure,SciFi"
#> [4] "Adventure,Drama,SciFi"
separate(movies, genre, into = c("genre1", "genre2", "genre3"))
#> Warning: Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [3].
#> title genre1 genre2 genre3
#> 1 Interstellar Adventure Drama SciFi
#> 2 Back to the Future Adventure Comedy SciFi
#> 3 2001: A Space Odyssey Adventure SciFi <NA>
#> 4 The Martian Adventure Drama SciFi
Created on 2023-01-31 by the reprex package (v2.0.1)
CodePudding user response:
What about long format or list columns? Both would let you filter over genres while dealing with multiple misaligned genre columns isn't much fun. For example something like this:
library(dplyr)
library(tidyr)
library(purrr)
library(stringr)
title<-c("Interstellar", "Back to the Future", "2001: A Space Odyssey", "The Martian")
genre<-c("Adventure, Drama, SciFi ", "Adventure Comedy SciFi", "Adventure, Sci-Fi", "Adventure Drama Sci-Fi")
movies<-data.frame(title, genre)
# long format
movies_long <- movies %>%
mutate(id = row_number(), .before = 1, genre = str_remove_all(genre, "-") %>% str_squish()) %>%
separate_rows(genre, sep = ",? ")
movies_long
#> # A tibble: 11 × 3
#> id title genre
#> <int> <chr> <chr>
#> 1 1 Interstellar Adventure
#> 2 1 Interstellar Drama
#> 3 1 Interstellar SciFi
#> 4 2 Back to the Future Adventure
#> 5 2 Back to the Future Comedy
#> 6 2 Back to the Future SciFi
#> 7 3 2001: A Space Odyssey Adventure
#> 8 3 2001: A Space Odyssey SciFi
#> 9 4 The Martian Adventure
#> 10 4 The Martian Drama
#> 11 4 The Martian SciFi
# filter by genre
movies_long %>% filter(genre == "Adventure")
#> # A tibble: 4 × 3
#> id title genre
#> <int> <chr> <chr>
#> 1 1 Interstellar Adventure
#> 2 2 Back to the Future Adventure
#> 3 3 2001: A Space Odyssey Adventure
#> 4 4 The Martian Adventure
# list columns, genre column will be filled with lists of genres
movies_lst <- movies %>%
mutate(genre = str_remove_all(genre, "-") %>% str_squish() %>% str_split(",? ")) %>%
as_tibble()
movies_lst
#> # A tibble: 4 × 2
#> title genre
#> <chr> <list>
#> 1 Interstellar <chr [3]>
#> 2 Back to the Future <chr [3]>
#> 3 2001: A Space Odyssey <chr [2]>
#> 4 The Martian <chr [3]>
# can be filtered with e.g. map_lgl and for output can be concatenated to a single string
movies_lst %>% filter(
map_lgl(genre, ~ all(c("Drama", "SciFi") %in% .x))) %>%
mutate(genre = map_chr(genre, ~paste(.x , collapse = ", ")))
#> # A tibble: 2 × 2
#> title genre
#> <chr> <chr>
#> 1 Interstellar Adventure, Drama, SciFi
#> 2 The Martian Adventure, Drama, SciFi
Created on 2023-01-31 with reprex v2.0.2
CodePudding user response:
This approach lists genres separately in their columns and expands automatically when new genres occur.
library(dplyr)
library(tidyr)
movies %>%
mutate(genre = strsplit(genre, ", | ")) %>%
rowwise() %>%
mutate(genre = list(sub("-", "", genre))) %>%
unnest(genre) %>%
group_by(genre) %>%
mutate(grp = cur_group_id()) %>%
arrange(grp) %>%
pivot_wider(names_from=grp, names_prefix="genre_", values_from=genre)
# A tibble: 4 × 5
title genre_1 genre_2 genre_3 genre_4
<chr> <chr> <chr> <chr> <chr>
1 Interstellar Adventure NA Drama SciFi
2 Back to the Future Adventure Comedy NA SciFi
3 2001: A Space Odyssey Adventure NA NA SciFi
4 The Martian Adventure NA Drama SciFi