Home > Back-end >  data.table text filtering R
data.table text filtering R

Time:12-08

I am trying to filter some text of a data.table looking for a similar way to dplyr::filter (I am using a data.table approach for efficiency reasons).

However, the filtering process in data.table only returns strings where the exact match is found. Contrarily, dplyr::filter returns rows where the pattern is found, not only when it is the exact pattern.

See below for an example.

df <- data.frame (first  = c("value_1 and value_2", "value_2", "value_1", "value_1"),
                  second = c(1, 2, 3, 4))

dt.output <- setDT(df)[first %in% c("value_1") ]
filter.output <- dplyr::filter(df, grepl("value_1", first))

dt.output only returns the rows that uniquely contain value_1 (3, 4). filter.output returns rows that contains value_1 (1, 3, 4)

Is it possible to use data.table to filter text while returning the same results as dplyr::filter?

df <- data.frame (first  = c("value_1 and value_2", "value_2", "value_1", "value_1"),
                  second = c(1, 2, 3, 4))

dt.output <- setDT(df)[first %in% c("value_1") ]
filter.output <- dplyr::filter(df, grepl("value_1", first))

CodePudding user response:

This behavior is not a dplyr::filter vs data.table. It is just that %in% is looking for fixed matches while grepl returns TRUE for substring matches as well. If we use grepl in the data.table, it works as well

library(data.table)
setDT(df)[grepl("value_1", first)]
                  first second
1: value_1 and value_2      1
2:             value_1      3
3:             value_1      4

Or may also use %like%

 setDT(df)[first %like% "value_1"]
                 first second
1: value_1 and value_2      1
2:             value_1      3
3:             value_1      4
  • Related