Home > Enterprise >  Stringr regex to work even when string is in a different order
Stringr regex to work even when string is in a different order

Time:09-23

I tried looking at a few of the other available Q/A's, but none of the ones I looked at seemed to work for me and/or there wasn't enough detail for me to understand how to implement the solutions. I'm not completely accustomed to working with regex, so I'm having a hard time coming up with a pattern. I have multiple text strings, some of which could be combined within the data and these combined strings could be placed in either order. It's a long data set and I will have to repeat a similar process for multiple columns being created within the data frame with the contents based partly on the TRUE/FALSE from the str_detect function, so efficiency is rather important.

Also important to note, as I have seen it mentioned in other answers, I know nothing about Python/Perl. I'm working in RStudio.

First, a similar and simpler data frame to my data.

SN <- c(1001, 1002, 1003, 1004)
fwd_fender <- c(1, 0, 1, 1)
note <- c("FWD FNDR DMG", 
          "MID BODY CHASS DMG", 
          "FWD FNDR EXCESS WEAR, MID BODY CHASS DMG", 
          "MID BODY CHASS DMG, FWD FNDR PAINT SCRATCH")
df <- data.frame(SN, fwd_fender, note)

which produced this data frame:

SN fwd_fender                                       note
1 1001          1                               FWD FNDR DMG
2 1002          0                         MID BODY CHASS DMG
3 1003          1   FWD FNDR EXCESS WEAR, MID BODY CHASS DMG
4 1004          1 MID BODY CHASS DMG, FWD FNDR PAINT SCRATCH

What I am basically needing to do is to create another column that we'll call fwd_fender_mech_dmg where I can recode the observations based on the note column. In this example, both "FWD FNDR DMG" and "FWD FNDR EXCESS WEAR" count as "fwd_fender_mechanical_dmg". The other notes do not. So I want to end up producing a data frame that looks like the following:

SN fwd_fender                                       note fwd_fender_mech_dmg
1 1001          1                               FWD FNDR DMG                   1
2 1002          0                         MID BODY CHASS DMG                   0
3 1003          1   FWD FNDR EXCESS WEAR, MID BODY CHASS DMG                   1
4 1004          1 MID BODY CHASS DMG, FWD FNDR PAINT SCRATCH                   0

I have lots of columns with lots of different variables, so I'm trying to use a regex as much as possible (ideally) in order to make the coding more efficient, but I'm not getting it to work quite right.

So here is a basic test sequence and pattern.

yes <- c("FWD FNDR DMG", "FWD FENDER EXCESS WEAR")
no <- c("MID BODY CHASS DMG", "FWD FNDR PAINT SCRATCH")
maybe <- c("FWD FNDR EXCESS WEAR, MID BODY CHASS DMG", "MID BODY CHASS DMG, FWD FNDR PAINT SCRATCH") 
s <- c(yes, no, maybe)
pattern <- "FE?ND.*(WEAR|DMG)"
str_detect(s,pattern,negate = FALSE)

which results in the following:

[1]  TRUE  TRUE FALSE FALSE  TRUE FALSE

This is an expected results But notice, if I switch the order of the last two entries in maybe, the code producing incorrect results.

yes <- c("FWD FNDR DMG", "FWD FENDER EXCESS WEAR")
no <- c("MID BODY CHASS DMG", "FWD FNDR PAINT SCRATCH")
maybe <- c("FWD FNDR EXCESS WEAR, MID BODY CHASS DMG", "FWD FNDR PAINT SCRATCH, MID BODY CHASS DMG") #Last Entry Reversed
s <- c(yes, no, maybe)
pattern <- "FE?ND.*(WEAR|DMG)"
str_detect(s,pattern,negate = FALSE)

which produces this result:

[1]  TRUE  TRUE FALSE FALSE  TRUE  TRUE

So, any ideas on how I can make this work?

Thanks!

CodePudding user response:

I'd recommend using the case_when function from the tidyverse:

df %>% 
  mutate(fwd_fender_mech_dmg = case_when(grepl("FWD FNDR DMG", note) ~ 1,
                                         grepl("FWD FNDR EXCESS WEAR", note) ~ 1,
                                         TRUE ~ 0))

Output:

    SN fwd_fender                                       note fwd_fender_mech_dmg
1 1001          1                               FWD FNDR DMG                   1
2 1002          0                         MID BODY CHASS DMG                   0
3 1003          1   FWD FNDR EXCESS WEAR, MID BODY CHASS DMG                   1
4 1004          1 MID BODY CHASS DMG, FWD FNDR PAINT SCRATCH                   0

CodePudding user response:

The problem with your code is that you are considering 'xxx , yyy' as two character elements (like in a character vector), but it is actually one two-word character scalar.

If we want your current regex to work, we can first str_split the strings by the comma, then call str_detect on all substrings, and, finally, reduce the output to a single logical per row.

library(stringr)
library(purrr)
library(dplyr)

df %>% mutate(fwd_fender_mech_dmg= str_split(note, ',') %>%
                      map(~str_detect(.x, "FE?ND.*(WEAR|DMG)")%>%
                      reduce(`|`)))
    SN fwd_fender                                       note fwd_fender_mech_dmg
1 1001          1                               FWD FNDR DMG                   1
2 1002          0                         MID BODY CHASS DMG                   0
3 1003          1   FWD FNDR EXCESS WEAR, MID BODY CHASS DMG                   1
4 1004          1 MID BODY CHASS DMG, FWD FNDR PAINT SCRATCH                   0

this is consistent with any order of the coma-separated substrings:

df2<-df%>%mutate(note=replace(note, 4, maybe[2]))

df2 %>% mutate(fwd_fender_mech_dmg = as.integer(str_split(note, ',') %>%
                      map(~str_detect(.x, "FE?ND.*(WEAR|DMG)")%>%
                      reduce(`|`))))

    SN fwd_fender                                       note fwd_fender_mech_dmg
1 1001          1                               FWD FNDR DMG                   1
2 1002          0                         MID BODY CHASS DMG                   0
3 1003          1   FWD FNDR EXCESS WEAR, MID BODY CHASS DMG                   1
4 1004          1 FWD FNDR PAINT SCRATCH, MID BODY CHASS DMG                   0

Some advice

It seems that your data is not "tidy". The "note" variable occasionally has two data values collapsed as a single char element.

It may make your life much easier in downstream analyses if you separate the data so that there will always be one observation per value.

For that, you may want to do like this:

library(dpplyr)
library(stringr)
library(tidyr)

df %>% tidyr::separate_rows(note, sep='\\s*,\\s*') %>% #this separates the rows
        mutate(fwd_fender_mech_dmg =  str_detect(note, "FE?ND.*(WEAR|DMG)"))

# A tibble: 6 x 4
     SN fwd_fender note                   fwd_fender_mech_dmg
  <dbl>      <dbl> <chr>                                <int>
1  1001          1 FWD FNDR DMG                             1
2  1002          0 MID BODY CHASS DMG                       0
3  1003          1 FWD FNDR EXCESS WEAR                     1
4  1003          1 MID BODY CHASS DMG                       0
5  1004          1 MID BODY CHASS DMG                       0
6  1004          1 FWD FNDR PAINT SCRATCH                   0
  • Related