Home > Enterprise >  R: Dropping columns with names containing a substring anywhere except the start using regular expres
R: Dropping columns with names containing a substring anywhere except the start using regular expres

Time:07-15

I am trying to use dplyr to drop columns from a data.frame where the column name contains a substring anywhere except the start of the name (i.e. any index other than the first).

After looking around (pun intended), it appears that this is usually accomplished by including a lookbehind assertion in the regular expression I pass to dplyr::matches() within the dplyr::select() call. I'm not familiar with how these work, but my attempt at implementing this below throws an error.

Am I incorrectly implementing a lookbehind or is this a limitation of the regular expressions I can pass to matches()? I welcome a working solution.

library(dplyr)

# Example data
df <- data.frame(bar = rnorm(1),
                 foo1 = rnorm(1),
                 bar_foo1 = rnorm(1),
                 bar_foo1_bat = rnorm(1))

# Desired output
df %>% select(bar, foo1)
#>        bar       foo1
#> 1 1.057651 -0.1526598

# Sucessfully drops columns with "foo1" anywhere
df %>% select(-matches(".*foo1.*"))
#>        bar
#> 1 1.057651

# Both fail to drop columns with "foo1" anywhere *except the start of the string*

df %>% select(-matches("(?<!^).*foo1.*"))
#> Warning in grep(needle, haystack, ...): TRE pattern compilation error 'Invalid
#> regexp'
#> Error in `select()`:
#> ! invalid regular expression '(?<!^).*foo1.*', reason 'Invalid regexp'

df %>% select(-matches("(?<!^)foo1.*"))
#> Warning in grep(needle, haystack, ...): TRE pattern compilation error 'Invalid
#> regexp'
#> Error in `select()`:
#> ! invalid regular expression '(?<!^)foo1.*', reason 'Invalid regexp'

Created on 2022-07-14 by the reprex package (v2.0.1)

CodePudding user response:

You need one or more of . at the beginning so you could write ^.{1,}.

df %>% dplyr::select(-matches("^.{1,}foo1"))
#         bar       foo1
# 1 -1.077056 -0.5649875

CodePudding user response:

df%>%select(-matches('^. foo1'))
       bar       foo1
1 1.806521 -0.9380235
  • Related