nesting back into a vector in dplyr-CodePudding

I have a dataset with multiple names in a column separated by commas. I want to create new variables based on whether a certain observation contains a certain name. The problem is that the names have leading and trailing white spaces and excessive spaces in the middle of a name so I want to remove them first with str_trim and str_squish. Here's an example:

library(tidyverse)
favs <- c("The Departed, The Green Mile,IT ,Spirit,The  Irishman",
            " Titanic, The Shawshank  Redemption ",
            "  The Godfather  ,Pulp  Fiction")
mov_data <- data.frame(favs)
mov_data <-  mov_data %>%  mutate(movies  = str_split(favs,",")) # Creates a column of vectors. This is how i want the end result after fixing the names to look like.
mov_data %>% unnest() %>% 
  mutate(movies = str_squish(str_trim(movies))) # fixes the names

The code above fixes the names but when I try to nest it back into a vector using nest() , the names get nested into a tibble and not into the vectors they originally came from. How do I nest the rows back into a column of vectors?

CodePudding user response：

The easiest would probably be to do it all in one go (using your approach):

library(dplyr)
library(stringr)

mov_data |> 
  group_by(favs) |> 
  mutate(movies = list(str_squish(str_trim(str_split(favs, ",", simplify = TRUE))))) |>
  ungroup()

Output:

# A tibble: 3 × 2
  favs                                                    movies   
  <chr>                                                   <list>   
1 "The Departed, The Green Mile,IT ,Spirit,The  Irishman" <chr [5]>
2 " Titanic, The Shawshank  Redemption "                  <chr [2]>
3 "  The Godfather  ,Pulp  Fiction"                       <chr [2]>

Or to answer your question..

I see three ways:

The most direct way would be to avoid the unnesting using a map:

library(purrr)
library(dplyr)
library(stringr)

mov_data |>
  mutate(movies = map(movies, ~ str_squish(str_trim(.))))

Output:

# A tibble: 3 × 2
  favs                                                    movies   
  <chr>                                                   <list>   
1 "The Departed, The Green Mile,IT ,Spirit,The  Irishman" <chr [5]>
2 " Titanic, The Shawshank  Redemption "                  <chr [2]>
3 "  The Godfather  ,Pulp  Fiction"                       <chr [2]>

Another way is to use summarize:

library(dplyr)
library(stringr)

mov_data |>
    unnest(cols = c(movies)) |>
    mutate(movies = str_squish(str_trim(movies))) |>
    group_by(favs) |>
      summarise(movies = list(movies)) |>
    ungroup()

Output:

# A tibble: 3 × 2
  favs                                                    movies   
  <chr>                                                   <list>   
1 "  The Godfather  ,Pulp  Fiction"                       <chr [2]>
2 " Titanic, The Shawshank  Redemption "                  <chr [2]>
3 "The Departed, The Green Mile,IT ,Spirit,The  Irishman" <chr [5]>

The third way (to nest it) again you'd already explored (and it'll give you a tibble, which will be fine in most circumstances).

library(dplyr)
library(stringr)

mov_data |>
  unnest(cols = c(movies)) |>
  mutate(movies = str_squish(str_trim(movies))) |>
  group_by(favs) |>
    nest() |>
  unnest()

Output:

# A tibble: 3 × 2
  favs                                                    data            
  <chr>                                                   <list>          
1 "The Departed, The Green Mile,IT ,Spirit,The  Irishman" <tibble [5 × 1]>
2 " Titanic, The Shawshank  Redemption "                  <tibble [2 × 1]>
3 "  The Godfather  ,Pulp  Fiction"                       <tibble [2 × 1]>

CodePudding user response：

Here's another if you want the output similar to your input, but corrected for extra whitespace. This uses the textclean package.


library(textclean)

favs_list <- str_split(favs,',')

corrected_terms <- lapply(favs_list, function(x){
  paste(sapply(x, function(y) trimws(textclean::replace_white(y)), USE.NAMES = FALSE), collapse = ', ')
})

unlist(corrected_terms)

Output:

[1] "The Departed, The Green Mile, IT, Spirit, The Irishman" "Titanic, The Shawshank Redemption"                     
[3] "The Godfather, Pulp Fiction"

CodePudding user response：

If the purpose of this is to create a second column called movies that is a list of trimmed character strings then we can use separate_rows to convert to long form, perform any trimming you want and then summarize the result into a vector for each row.

library(dplyr)
library(tidyr)

mov_data %>%
  mutate(no = row_number(), movies = favs) %>%
  separate_rows(movies, sep = ",") %>%
  mutate(movies = trimws(movies)) %>%
  group_by(no, favs) %>%
  summarize(movies = list(movies), .groups = "drop")

giving:

# A tibble: 3 × 3
     no favs                                                    movies   
  <int> <chr>                                                   <list>   
1     1 "The Departed, The Green Mile,IT ,Spirit,The  Irishman" <chr [5]>
2     2 " Titanic, The Shawshank  Redemption "                  <chr [2]>
3     3 "  The Godfather  ,Pulp  Fiction"                       <chr [2]>