I have a dataset with multiple names in a column separated by commas. I want to create new variables based on whether a certain observation contains a certain name. The problem is that the names have leading and trailing white spaces and excessive spaces in the middle of a name so I want to remove them first with str_trim and str_squish. Here's an example:
library(tidyverse)
favs <- c("The Departed, The Green Mile,IT ,Spirit,The Irishman",
" Titanic, The Shawshank Redemption ",
" The Godfather ,Pulp Fiction")
mov_data <- data.frame(favs)
mov_data <- mov_data %>% mutate(movies = str_split(favs,",")) # Creates a column of vectors. This is how i want the end result after fixing the names to look like.
mov_data %>% unnest() %>%
mutate(movies = str_squish(str_trim(movies))) # fixes the names
The code above fixes the names but when I try to nest it back into a vector using nest() , the names get nested into a tibble and not into the vectors they originally came from. How do I nest the rows back into a column of vectors?
CodePudding user response:
The easiest would probably be to do it all in one go (using your approach):
library(dplyr)
library(stringr)
mov_data |>
group_by(favs) |>
mutate(movies = list(str_squish(str_trim(str_split(favs, ",", simplify = TRUE))))) |>
ungroup()
Output:
# A tibble: 3 × 2
favs movies
<chr> <list>
1 "The Departed, The Green Mile,IT ,Spirit,The Irishman" <chr [5]>
2 " Titanic, The Shawshank Redemption " <chr [2]>
3 " The Godfather ,Pulp Fiction" <chr [2]>
Or to answer your question..
I see three ways:
1)
The most direct way would be to avoid the unnesting using a map:
library(purrr)
library(dplyr)
library(stringr)
mov_data |>
mutate(movies = map(movies, ~ str_squish(str_trim(.))))
Output:
# A tibble: 3 × 2
favs movies
<chr> <list>
1 "The Departed, The Green Mile,IT ,Spirit,The Irishman" <chr [5]>
2 " Titanic, The Shawshank Redemption " <chr [2]>
3 " The Godfather ,Pulp Fiction" <chr [2]>
2)
Another way is to use summarize
:
library(dplyr)
library(stringr)
mov_data |>
unnest(cols = c(movies)) |>
mutate(movies = str_squish(str_trim(movies))) |>
group_by(favs) |>
summarise(movies = list(movies)) |>
ungroup()
Output:
# A tibble: 3 × 2
favs movies
<chr> <list>
1 " The Godfather ,Pulp Fiction" <chr [2]>
2 " Titanic, The Shawshank Redemption " <chr [2]>
3 "The Departed, The Green Mile,IT ,Spirit,The Irishman" <chr [5]>
3)
The third way (to nest
it) again you'd already explored (and it'll give you a tibble, which will be fine in most circumstances).
library(dplyr)
library(stringr)
mov_data |>
unnest(cols = c(movies)) |>
mutate(movies = str_squish(str_trim(movies))) |>
group_by(favs) |>
nest() |>
unnest()
Output:
# A tibble: 3 × 2
favs data
<chr> <list>
1 "The Departed, The Green Mile,IT ,Spirit,The Irishman" <tibble [5 × 1]>
2 " Titanic, The Shawshank Redemption " <tibble [2 × 1]>
3 " The Godfather ,Pulp Fiction" <tibble [2 × 1]>
CodePudding user response:
Here's another if you want the output similar to your input, but corrected for extra whitespace. This uses the textclean
package.
library(textclean)
favs_list <- str_split(favs,',')
corrected_terms <- lapply(favs_list, function(x){
paste(sapply(x, function(y) trimws(textclean::replace_white(y)), USE.NAMES = FALSE), collapse = ', ')
})
unlist(corrected_terms)
Output:
[1] "The Departed, The Green Mile, IT, Spirit, The Irishman" "Titanic, The Shawshank Redemption"
[3] "The Godfather, Pulp Fiction"
CodePudding user response:
If the purpose of this is to create a second column called movies that is a list of trimmed character strings then we can use separate_rows to convert to long form, perform any trimming you want and then summarize the result into a vector for each row.
library(dplyr)
library(tidyr)
mov_data %>%
mutate(no = row_number(), movies = favs) %>%
separate_rows(movies, sep = ",") %>%
mutate(movies = trimws(movies)) %>%
group_by(no, favs) %>%
summarize(movies = list(movies), .groups = "drop")
giving:
# A tibble: 3 × 3
no favs movies
<int> <chr> <list>
1 1 "The Departed, The Green Mile,IT ,Spirit,The Irishman" <chr [5]>
2 2 " Titanic, The Shawshank Redemption " <chr [2]>
3 3 " The Godfather ,Pulp Fiction" <chr [2]>