I have a dataframe column that has a string, which may include several spaces. I want to use separate
from tidyr
(or something similar) on the space after the first time a keyword (i.e., fruit_key
in the sample data) appears, so that I separate the one column into two columns.
Sample Data
df <- structure(list(fruit = c("Apple Orange Pineapple", "Plum Good Watermelon",
"Plum Good Kiwi", "Plum Good Plum Good", "Cantaloupe Melon", "Blueberry Blackberry Cobbler",
"Peach Pie Apple Pie")), class = "data.frame", row.names = c(NA,
-7L))
fruit_key <- c("Apple", "Plum Good", "Cantaloupe", "Blueberry", "Peach Pie")
Expected Output
fruit Delicious Tasty
1 Apple Orange Pineapple Apple Orange Pineapple
2 Plum Good Watermelon Plum Good Watermelon
3 Plum Good Kiwi Plum Good Kiwi
4 Plum Good Plum Good Plum Good Plum Good
5 Cantaloupe Melon Cantaloupe Melon
6 Blueberry Blackberry Cobbler Blueberry Blackberry Cobbler
7 Peach Pie Apple Pie Peach Pie Apple Pie
I can get the part after the keyword with separate
into the correct column (i.e., Tasty
), but cannot get the actual keyword to return for the other column (i.e., Delicious
). I tried several altering the regular expression, but could never get the correct output.
library(tidyr)
separate(df, fruit,
c("Delicious", "Tasty"),
sep = paste(fruit_key, collapse = "|"),
extra = "merge",
remove = FALSE
)
# fruit Delicious Tasty
#1 Apple Orange Pineapple Orange Pineapple
#2 Plum Good Watermelon Watermelon
#3 Plum Good Kiwi Kiwi
#4 Plum Good Plum Good Plum Good
#5 Cantaloupe Melon Melon
#6 Blueberry Blackberry Cobbler Blackberry Cobbler
#7 Peach Pie Apple Pie Apple Pie
I know that I could use str_extract
and str_remove
(like below), but want to use something like separate
to do it in one function/step.
library(tidyverse)
df %>%
mutate(Delicious = str_extract(fruit, paste(fruit_key, collapse = "|")),
Tasty = str_remove(fruit, paste(fruit_key, collapse = "|")))
CodePudding user response:
Here's a tidy solution with tidyr
's function extract
:
library(tidyr)
df %>%
extract(fruit,
into = c("Delicious", "Tasty"),
regex = paste0("(", paste0(fruit_key, collapse = "|"), ")\\s(.*)"),
remove = FALSE)
fruit Delicious Tasty
1 Apple Orange Pineapple Apple Orange Pineapple
2 Plum Good Watermelon Plum Good Watermelon
3 Plum Good Kiwi Plum Good Kiwi
4 Plum Good Plum Good Plum Good Plum Good
5 Cantaloupe Melon Cantaloupe Melon
6 Blueberry Blackberry Cobbler Blueberry Blackberry Cobbler
7 Peach Pie Apple Pie Peach Pie Apple Pie
In extract
's regex argument, we collapse fruit_key
into an alternation pattern, which we wrap into parentheses so that it is recognized as a capturing group. The second capturing group is simply whatever follows after the whitespace.
CodePudding user response:
If we need to use separate
with sep
, then create a regex lookaround - "(?<=<fruit_key>) "
i.e. split at the space that succeeds the fruit_key word and as is not vectorized, collapse
into a single string with |
(str_c
)
library(dplyr)
library(tidyr)
library(stringr)
df %>%
separate(fruit, into = c("Delicious", "Tasty"),
sep = str_c(sprintf("(?<=%s) ", fruit_key), collapse = "|"),
extra = "merge", remove = FALSE)
-output
fruit Delicious Tasty
1 Apple Orange Pineapple Apple Orange Pineapple
2 Plum Good Watermelon Plum Good Watermelon
3 Plum Good Kiwi Plum Good Kiwi
4 Plum Good Plum Good Plum Good Plum Good
5 Cantaloupe Melon Cantaloupe Melon
6 Blueberry Blackberry Cobbler Blueberry Blackberry Cobbler
7 Peach Pie Apple Pie Peach Pie Apple Pie