I have a data table and I'd like to copy or subset to all the unique observations in variables whose variable name contains a certain substring. I can conceive of a way to do it clunkily with for loops, but that seems to defeat the utility and speed of a data table.
For example:
library(data.table)
library(dplyr)
dt <- setDT(tibble(a_ID = c(1, 1, 2, 11, 2),
b = c('x','y','z','z', 'x'),
b_ID = c('XY','XY','XY','XY', 'XY'),
d = 1:5))
What I would like to do from here is subset by all columns whose name contains 'ID', returning only the unique entries from each column, and since that will result in columns with differing amounts of observations for each, fill with NA
for the rest.
That is, I'd like to return the following:
subset_dt <- setDT(tibble(a_ID = c(1,2,11),
b_ID = c('XY', NA,NA)))
Is this possible with data.table functionality?
CodePudding user response:
We may get the unique
elements and then replace
the duplicated
with NA
library(data.table)
dt[, lapply(.SD, unique), .SDcols = patterns("_ID$")][,
lapply(.SD, \(x) replace(x, duplicated(x), NA))]
-output
a_ID b_ID
<num> <char>
1: 1 XY
2: 2 <NA>
3: 11 <NA>
Or another option with unique
unique(dt[, .(a_ID, b_ID)])[, lapply(.SD, \(x) fcase(!duplicated(x), x))]
a_ID b_ID
<num> <char>
1: 1 XY
2: 2 <NA>
3: 11 <NA>
Or another option is to block the code, check for the length
s after the unique
step and append NA
to fix the length
dt[, {lst1 <- lapply(.SD, unique)
mx <- max(lengths(lst1))
lapply(lst1, `length<-`, mx)}, .SDcols = patterns("_ID$")]
a_ID b_ID
<num> <char>
1: 1 XY
2: 2 <NA>
3: 11 <NA>
We may also use collapse
- select the columns (gvr
), get the unique rows (funique
), loop over the columns with dapply
, replace
the duplicates with NA
library(collapse)
dapply(funique(gvr(dt, "_ID$")), MARGIN = 2,
FUN = \(x) replace(x, duplicated(x), NA))
a_ID b_ID
<num> <char>
1: 1 XY
2: 2 <NA>
3: 11 <NA>