Home > Software engineering >  Find the original strings from the results of str_extract_all
Find the original strings from the results of str_extract_all

Time:03-24

I'm still young to coding and cannot figure out better functions or results to some tasks by myself very often. I have a question on tracking the original string after using str_extract_all for a specific pattern.

Here is an example data called "fruit".

index Fruit
1 apple
2 banana
3 strawberry
4 pineapple
5 bell pepper

I used str_extract_all(fruit, "(.)\\1") to extract duplicated consonants, and get "pp", "rr", "pp", "ll", "pp".

Also tracked the original string (of those extracted results) by str_subset(fruit, "(.)\\1"). Here's what I get.

index fruit
1 apple
2 strawberry
3 pineapple
4 bell pepper

However, I want to know where "each" extracted result is from. Therefore, using str_subset cannot capture those results which are from the same string. The following dataframe is what I expect to gain.

index fruit pattern
1 apple pp
2 strawberry rr
3 pineapple pp
4 bell pepper ll
4 bell pepper pp

I'm not sure if I explain my question clearly. Any feedbacks and ideas will be appreciate.

CodePudding user response:

Your code already did what you want. You just need to create an extra column to store the output of str_extract_all, like the following:

library(tidyverse)

fruit %>% mutate(pattern = str_extract_all(Fruit, "(.)\\1")) %>% unnest(pattern)

# A tibble: 5 × 3
  index Fruit       pattern
  <int> <chr>       <chr>  
1     1 apple       pp     
2     3 strawberry  rr     
3     4 pineapple   pp     
4     5 bell pepper ll     
5     5 bell pepper pp   

CodePudding user response:

You may simply remove all rows where pattern column contains NA after extracting:

library(stringr)
df$pattern <- stringr::str_extract(df$xfruit "(.)\\1")
df <- df[!is.na(df$pattern),]

The str_extract(df$xfruit "(.)\\1") will extract the repeated chars into the pattern column. Then, df[!is.na(df$pattern),] removes the rows where the pattern is NA.

  • Related