I'm still young to coding and cannot figure out better functions or results to some tasks by myself very often.
I have a question on tracking the original string after using str_extract_all
for a specific pattern.
Here is an example data called "fruit".
index | Fruit |
---|---|
1 | apple |
2 | banana |
3 | strawberry |
4 | pineapple |
5 | bell pepper |
I used str_extract_all(fruit, "(.)\\1")
to extract duplicated consonants, and get "pp", "rr", "pp", "ll", "pp".
Also tracked the original string (of those extracted results) by str_subset(fruit, "(.)\\1")
. Here's what I get.
index | fruit |
---|---|
1 | apple |
2 | strawberry |
3 | pineapple |
4 | bell pepper |
However, I want to know where "each" extracted result is from. Therefore, using str_subset
cannot capture those results which are from the same string. The following dataframe is what I expect to gain.
index | fruit | pattern |
---|---|---|
1 | apple | pp |
2 | strawberry | rr |
3 | pineapple | pp |
4 | bell pepper | ll |
4 | bell pepper | pp |
I'm not sure if I explain my question clearly. Any feedbacks and ideas will be appreciate.
CodePudding user response:
Your code already did what you want. You just need to create an extra column to store the output of str_extract_all
, like the following:
library(tidyverse)
fruit %>% mutate(pattern = str_extract_all(Fruit, "(.)\\1")) %>% unnest(pattern)
# A tibble: 5 × 3
index Fruit pattern
<int> <chr> <chr>
1 1 apple pp
2 3 strawberry rr
3 4 pineapple pp
4 5 bell pepper ll
5 5 bell pepper pp
CodePudding user response:
You may simply remove all rows where pattern
column contains NA
after extracting:
library(stringr)
df$pattern <- stringr::str_extract(df$xfruit "(.)\\1")
df <- df[!is.na(df$pattern),]
The str_extract(df$xfruit "(.)\\1")
will extract the repeated chars into the pattern
column. Then, df[!is.na(df$pattern),]
removes the rows where the pattern
is NA
.