Home > Software engineering >  String operations with stringr not working depending on vectorized/unvectorized call
String operations with stringr not working depending on vectorized/unvectorized call

Time:05-18

I'm struggling understanding why my code below works only when using rowwise in combination with ifelse. Or more precisely, I think I get why it is working in that scenario, but not why it doesn't simply work with if_else.

What I'm doing is, I'm checking if a certain rows contains the word "infile" or "outfile" and if it has a relative path (".."). If it does have the words "infile/outfile" and not a relative path, then it has an absolute path "C:". And in that case, I want to replace the user name with something else (here: "test").

Any ideas?

Data:

df <- structure(list(value = c("infile 'C:\\Users\\USER\\folder\\Data.sav'", 
"infile '..\\folder\\Data.sav'", "outfile '..\\folder\\Data.sav'", 
"test", "")), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-5L))

user_name <- "test"

Code that works:

df |> 
  rowwise() |> 
  mutate(value = ifelse(str_detect(value, "infile|outfile") & !str_detect(value, "\\'\\.\\.\\\\"),
                        str_replace(value,
                                    str_sub(value,
                                            str_locate_all(value, "\\\\")[[1]][2]   1,
                                            str_locate_all(value, "\\\\")[[1]][3] - 1),
                                    user_name),
                        value)) |> 
  ungroup()

with output:

# A tibble: 5 × 1
  value                                       
  <chr>                                       
1 "infile 'C:\\Users\\test\\folder\\Data.sav'"
2 "infile '..\\folder\\Data.sav'"             
3 "outfile '..\\folder\\Data.sav'"            
4 "test"                                      
5 ""   

Code that doesn't work:

df |> 
  mutate(value = if_else(str_detect(value, "infile|outfile") & !str_detect(value, "\\'\\.\\.\\\\"),
                        str_replace(value,
                                    str_sub(value,
                                            str_locate_all(value, "\\\\")[[1]][2]   1,
                                            str_locate_all(value, "\\\\")[[1]][3] - 1),
                                    user_name),
                        value))

I think this works, but gives a warning message:

Warning messages:
1: Problem while computing `value = if_else(...)`.
ℹ empty search patterns are not supported 
2: Problem while computing `value = if_else(...)`.
ℹ empty search patterns are not supported 

Code that doesn't work:

df |> 
  rowwise() |>
  mutate(value = if_else(str_detect(value, "infile|outfile") & !str_detect(value, "\\'\\.\\.\\\\"),
                        str_replace(value,
                                    str_sub(value,
                                            str_locate_all(value, "\\\\")[[1]][2]   1,
                                            str_locate_all(value, "\\\\")[[1]][3] - 1),
                                    user_name),
                        value)) |> 
  ungroup()

Error in `mutate()`:
! Problem while computing `value = if_else(...)`.
ℹ The error occurred in row 2.
Caused by error:
! Empty `pattern` not supported

CodePudding user response:

Here is one way (where my substitution of USER is very simple; not sure if it should be more generic):

df %>% 
    tidyr::separate(value, into = c('Type', 'Path'), sep = ' ') %>% 
    dplyr::mutate(
        Value = dplyr::if_else(
            (Type %in% c('infile', 'outfile')) & !startsWith(Path, "'.."),
            stringr::str_replace(Path, 'USER', user_name),
            Path
        )
    )

I split the value column to make the check easier.

If you need to replace the username with the variable you can do like this (here with back referencing the regular expression):

df %>% 
    tidyr::separate(value, into = c('Type', 'Path'), sep = ' ') %>% 
    dplyr::mutate(
        Value = dplyr::if_else(
            (Type %in% c('infile', 'outfile')) & !startsWith(Path, "'.."),
            sub('^(C:\\\\Users\\\\)([[:alnum:]] )\\\\', paste0('\\1', user_name, '\\\\'), Path),
            Path
        )
    )

CodePudding user response:

Basically, the issue is that without rowwise(), str_locate is looking at all 5 strings in df$value on each iteration, and returning the same indices for the beginning and ending of the string for each row. To debug, I'd suggest breaking the calculation out a bit:

df %>% rowwise() %>%
       mutate(n=length(value), slen=str_length(value),
              l1=str_locate_all(value,"\\\\")[[1]][2] 1,
              l2=str_locate_all(value,"\\\\")[[1]][3]-1, 
              ssub=str_sub(value, l1, l2), 
              detect=str_detect(value, "infile|outfile")& !str_detect(value,"\\'\\.\\.\\\\"), 
              vout=if_else(detect, ssub, user_name))
# A tibble: 5 × 8
# Rowwise: 
  value                                            n  slen    l1    l2 ssub   detect vout 
  <chr>                                        <int> <int> <dbl> <dbl> <chr>  <lgl>  <chr>
1 "infile 'C:\\Users\\USER\\folder\\Data.sav'"     1    38    18    21 "USER" TRUE   USER 
2 "infile '..\\folder\\Data.sav'"                  1    27    19    10 ""     FALSE  test 
3 "outfile '..\\folder\\Data.sav'"                 1    28    20    11 ""     FALSE  test 
4 "test"                                           1     4    NA    NA  NA    FALSE  test 
5 ""                                               1     0    NA    NA  NA    FALSE  test 

While without the rowwise(), mutate gets all the strings in the value column all at once, and it finds the same locations for your cuts on every single row:

df %>% 
       mutate(n=length(value), slen=str_length(value),
              l1=str_locate_all(value,"\\\\")[[1]][2] 1,
              l2=str_locate_all(value,"\\\\")[[1]][3]-1, 
              ssub=str_sub(value, l1, l2), 
              detect=str_detect(value, "infile|outfile")& !str_detect(value,"\\'\\.\\.\\\\"), 
              vout=if_else(detect, ssub, user_name))
# A tibble: 5 × 8
  value                                            n  slen    l1    l2 ssub    detect vout 
  <chr>                                        <int> <int> <dbl> <dbl> <chr>   <lgl>  <chr>
1 "infile 'C:\\Users\\USER\\folder\\Data.sav'"     5    38    18    21 "USER"  TRUE   USER 
2 "infile '..\\folder\\Data.sav'"                  5    27    18    21 "\\Dat" FALSE  test 
3 "outfile '..\\folder\\Data.sav'"                 5    28    18    21 "r\\Da" FALSE  test 
4 "test"                                           5     4    18    21 ""      FALSE  test 
5 ""                                               5     0    18    21 ""      FALSE  test 

Once you calculate the locations to subset your string incorrectly, I think you are just lucky that another error was thrown.

  • Related