Home > Blockchain >  Filtering a large data frame based on column values using R
Filtering a large data frame based on column values using R

Time:10-02

I have a very large dataframe with almost 502493 rows and 261 columns. I want to filter it and need IDs with specific codes (codes starting with 'E'). This is how my data looks like,

IDs code1 code2
1 C443 E109
2 AX31 M223
1 E341 QWE1
3 E131 M223

My required output is IDs with codes starting with 'E' only.

IDs code
1 E109
1 E341
3 E131

I am trying to use the 'filter' of dplyr package but not getting the required output. Thanks in advance

CodePudding user response:

We can reshape to 'long' format with pivot_longer and filter by creating a logical vector from the first character extracted (with substr)

library(dplyr)
library(tidyr)
df1 %>%
   pivot_longer(cols = starts_with("code"), 
       values_to = 'code', names_to = NULL) %>%   
   filter(substr(code, 1, 1) == "E")

-output

# A tibble: 3 × 2
    IDs code 
  <int> <chr>
1     1 E109 
2     1 E341 
3     3 E131 

If the data is really big, we may do a filter before the pivot_longer to keep only rows having at least one 'E' in the column

df1 %>%
   filter(if_any(starts_with('code'), ~ substr(., 1, 1) == 'E')) %>% 
   pivot_longer(cols = starts_with("code"), 
       values_to = 'code', names_to = NULL) %>%   
   filter(substr(code, 1, 1) == "E")

If it is a very big data, another option is data.table. Convert the data.frame to 'data.table' (setDT), loop across the columns of interest (.SDcols) with lapply, replace the elements that are not starting with "E" to NA, then use fcoalesce to get the first non-NA element for each row using do.call

library(data.table)
na.omit(setDT(df1)[, .(IDs, code = do.call(fcoalesce, 
    lapply(.SD, function(x) replace(x, substr(x, 1, 1) != "E", 
      NA)))), .SDcols = patterns("code")])

-output

   IDs code
1:   1 E109
2:   1 E341
3:   3 E131

data

df1 <- structure(list(IDs = c(1L, 2L, 1L, 3L), code1 = c("C443", "AX31", 
"E341", "E131"), code2 = c("E109", "M223", "QWE1", "M223")), 
class = "data.frame", row.names = c(NA, 
-4L))
  • Related