Home > Mobile >  Extracting rows based on more than two partial strings that must all be part of the string
Extracting rows based on more than two partial strings that must all be part of the string

Time:10-22

I want to extract rows that must contain two or more partial strings. For example, suppose I have the following data.table

df <- data.table(player=c('A', 'B', 'C', 'D', 'E', 'F', 'G'),
                 position=c('S Guard', 'P Guaaard', 'P Guard', 'S Forward',
                            'P Forward', 'Center', 'Center'),
                 points=c(28, 17, 19, 14, 23, 26, 5))

I know I can extract rows with either of the partial strings Gua or rd by using

df[grep("Gua|rd", df$position), ]

What if I want the the string to contain both Gua and rd? Tried

df[grep("Gua&rd", df$position), ]

both with and without space but doesn't work.

EDIT WITH MORE INFO

I'm actually dealing with relatively large datasets and I am using strings in one to extract data from another, something like this:

df1 <- data.frame(player=c('A', 'B', 'C', 'D', 'E', 'F', 'G'),
                 position1=c('S Guard', 'P Guard', 'P Guard', 'S Forward',
                            'P Forward', 'Center', 'Center'),
                 position2=c('S Grd', 'P Guard', 'P rdGua', 'S Forward',
                             'P Forward', 'Center', 'Center'))

df2 <- data.table(player=c('A', 'B', 'C', 'D', 'E', 'F', 'G'),
                 points=c(28, 17, 19, 14, 23, 26, 5))

df2[df1[,3] %like% "Gua|rd",c(1,2)]

The order therefore shouldn't matter i.e Guard and rdGua should return true if using & in place of |.

CodePudding user response:

If you replace | with .* you get the desired result. In stead of grep (or grepl), you can also use like from .

.* indicates that every character is allowed in between "Gua" and "rd".

Two options:

# option 1
df[grep("Gua.*rd", position)]

# option 2
df[like(position, "Gua.*rd")]

both give:

   player  position points
1:      A   S Guard     28
2:      B P Guaaard     17
3:      C   P Guard     19

Because you are using the -package, you don't need the df$ part in between the square brackets.

CodePudding user response:

This is probably an overkill but if you don't know the order and want to avoid having multiple grep() calls you could do:

# Helper function
grepAnd <- function(vect, strings) {
  grep(
    do.call(sprintf, c(strrep('(?=.*%s)', length(strings)), as.list(strings))), 
    vect, 
    perl = TRUE
  )
}

inp <- c('Gua', 'rd')
df[grepAnd(position, inp)]

#    player  position points
# 1:      A   S Guard     28
# 2:      B P Guaaard     17
# 3:      C   P Guard     19

# Works even though the arguments are reversed
inp <- rev(inp)
df[grepAnd(position, inp)]
#    player  position points
# 1:      A   S Guard     28
# 2:      B P Guaaard     17
# 3:      C   P Guard     19


# Can search for more than two patterns
inp <- c(inp, 'P')
df[grepAnd(position, inp)]
#    player  position points
# 1:      B P Guaaard     17
# 2:      C   P Guard     19
  • Related