Separate column on a space after keyword-CodePudding

I have a dataframe column that has a string, which may include several spaces. I want to use separate from tidyr (or something similar) on the space after the first time a keyword (i.e., fruit_key in the sample data) appears, so that I separate the one column into two columns.

Sample Data

df <- structure(list(fruit = c("Apple Orange Pineapple", "Plum Good Watermelon", 
"Plum Good Kiwi", "Plum Good Plum Good", "Cantaloupe Melon", "Blueberry Blackberry Cobbler", 
"Peach Pie Apple Pie")), class = "data.frame", row.names = c(NA, 
-7L))

fruit_key <- c("Apple", "Plum Good", "Cantaloupe", "Blueberry", "Peach Pie")

Expected Output

                         fruit   Delicious                Tasty
1       Apple Orange Pineapple       Apple     Orange Pineapple
2         Plum Good Watermelon   Plum Good           Watermelon
3               Plum Good Kiwi   Plum Good                 Kiwi
4          Plum Good Plum Good   Plum Good            Plum Good
5             Cantaloupe Melon  Cantaloupe                Melon
6 Blueberry Blackberry Cobbler   Blueberry   Blackberry Cobbler
7          Peach Pie Apple Pie   Peach Pie            Apple Pie

I can get the part after the keyword with separate into the correct column (i.e., Tasty), but cannot get the actual keyword to return for the other column (i.e., Delicious). I tried several altering the regular expression, but could never get the correct output.

library(tidyr)

separate(df, fruit,
 c("Delicious", "Tasty"),
 sep = paste(fruit_key, collapse = "|"),
 extra = "merge",
 remove = FALSE
)

#                         fruit Delicious               Tasty
#1       Apple Orange Pineapple              Orange Pineapple
#2         Plum Good Watermelon                    Watermelon
#3               Plum Good Kiwi                          Kiwi
#4          Plum Good Plum Good                     Plum Good
#5             Cantaloupe Melon                         Melon
#6 Blueberry Blackberry Cobbler            Blackberry Cobbler
#7          Peach Pie Apple Pie                     Apple Pie

I know that I could use str_extract and str_remove (like below), but want to use something like separate to do it in one function/step.

library(tidyverse)

df %>%
  mutate(Delicious = str_extract(fruit, paste(fruit_key, collapse = "|")),
         Tasty = str_remove(fruit, paste(fruit_key, collapse = "|")))

CodePudding user response：

Here's a tidy solution with tidyr's function extract:

library(tidyr)
df %>%
  extract(fruit,
          into = c("Delicious", "Tasty"),
          regex = paste0("(", paste0(fruit_key, collapse = "|"), ")\\s(.*)"),
          remove = FALSE)
                         fruit  Delicious              Tasty
1       Apple Orange Pineapple      Apple   Orange Pineapple
2         Plum Good Watermelon  Plum Good         Watermelon
3               Plum Good Kiwi  Plum Good               Kiwi
4          Plum Good Plum Good  Plum Good          Plum Good
5             Cantaloupe Melon Cantaloupe              Melon
6 Blueberry Blackberry Cobbler  Blueberry Blackberry Cobbler
7          Peach Pie Apple Pie  Peach Pie          Apple Pie

In extract's regex argument, we collapse fruit_keyinto an alternation pattern, which we wrap into parentheses so that it is recognized as a capturing group. The second capturing group is simply whatever follows after the whitespace.

CodePudding user response：

If we need to use separate with sep, then create a regex lookaround - "(?<=<fruit_key>) " i.e. split at the space that succeeds the fruit_key word and as is not vectorized, collapse into a single string with | (str_c)

library(dplyr)
library(tidyr)
library(stringr)
df %>% 
   separate(fruit, into = c("Delicious", "Tasty"), 
     sep = str_c(sprintf("(?<=%s) ", fruit_key), collapse = "|"), 
         extra = "merge", remove = FALSE)

-output

                       fruit  Delicious              Tasty
1       Apple Orange Pineapple      Apple   Orange Pineapple
2         Plum Good Watermelon  Plum Good         Watermelon
3               Plum Good Kiwi  Plum Good               Kiwi
4          Plum Good Plum Good  Plum Good          Plum Good
5             Cantaloupe Melon Cantaloupe              Melon
6 Blueberry Blackberry Cobbler  Blueberry Blackberry Cobbler
7          Peach Pie Apple Pie  Peach Pie          Apple Pie