I have a very large dataframe with almost 502493 rows and 261 columns. I want to filter it and need IDs with specific codes (codes starting with 'E'). This is how my data looks like,
IDs | code1 | code2 |
---|---|---|
1 | C443 | E109 |
2 | AX31 | M223 |
1 | E341 | QWE1 |
3 | E131 | M223 |
My required output is IDs with codes starting with 'E' only.
IDs | code |
---|---|
1 | E109 |
1 | E341 |
3 | E131 |
I am trying to use the 'filter' of dplyr package but not getting the required output. Thanks in advance
CodePudding user response:
We can reshape to 'long' format with pivot_longer
and filter
by creating a logical vector from the first character extracted (with substr
)
library(dplyr)
library(tidyr)
df1 %>%
pivot_longer(cols = starts_with("code"),
values_to = 'code', names_to = NULL) %>%
filter(substr(code, 1, 1) == "E")
-output
# A tibble: 3 × 2
IDs code
<int> <chr>
1 1 E109
2 1 E341
3 3 E131
If the data is really big, we may do a filter
before the pivot_longer
to keep only rows having at least one 'E' in the column
df1 %>%
filter(if_any(starts_with('code'), ~ substr(., 1, 1) == 'E')) %>%
pivot_longer(cols = starts_with("code"),
values_to = 'code', names_to = NULL) %>%
filter(substr(code, 1, 1) == "E")
If it is a very big data, another option is data.table
. Convert the data.frame to 'data.table' (setDT
), loop across the columns of interest (.SDcols
) with lapply
, replace
the elements that are not starting with "E" to NA
, then use fcoalesce
to get the first non-NA element for each row using do.call
library(data.table)
na.omit(setDT(df1)[, .(IDs, code = do.call(fcoalesce,
lapply(.SD, function(x) replace(x, substr(x, 1, 1) != "E",
NA)))), .SDcols = patterns("code")])
-output
IDs code
1: 1 E109
2: 1 E341
3: 3 E131
data
df1 <- structure(list(IDs = c(1L, 2L, 1L, 3L), code1 = c("C443", "AX31",
"E341", "E131"), code2 = c("E109", "M223", "QWE1", "M223")),
class = "data.frame", row.names = c(NA,
-4L))