Home > other >  Subset data table by all unique entries in columns whose name contain a certain substring, fill with
Subset data table by all unique entries in columns whose name contain a certain substring, fill with

Time:03-09

I have a data table and I'd like to copy or subset to all the unique observations in variables whose variable name contains a certain substring. I can conceive of a way to do it clunkily with for loops, but that seems to defeat the utility and speed of a data table.

For example:

library(data.table)
library(dplyr)
dt <- setDT(tibble(a_ID = c(1, 1, 2, 11, 2),
                   b = c('x','y','z','z', 'x'),
                   b_ID = c('XY','XY','XY','XY', 'XY'),
                   d = 1:5))

What I would like to do from here is subset by all columns whose name contains 'ID', returning only the unique entries from each column, and since that will result in columns with differing amounts of observations for each, fill with NA for the rest.

That is, I'd like to return the following:

subset_dt <- setDT(tibble(a_ID = c(1,2,11),
                          b_ID = c('XY', NA,NA)))

Is this possible with data.table functionality?

CodePudding user response:

We may get the unique elements and then replace the duplicated with NA

library(data.table)
dt[, lapply(.SD, unique), .SDcols = patterns("_ID$")][,
    lapply(.SD, \(x) replace(x, duplicated(x), NA))]

-output

    a_ID   b_ID
   <num> <char>
1:     1     XY
2:     2   <NA>
3:    11   <NA>

Or another option with unique

unique(dt[, .(a_ID, b_ID)])[, lapply(.SD, \(x) fcase(!duplicated(x), x))]
    a_ID   b_ID
   <num> <char>
1:     1     XY
2:     2   <NA>
3:    11   <NA>

Or another option is to block the code, check for the lengths after the unique step and append NA to fix the length

dt[, {lst1 <- lapply(.SD, unique)
     mx <- max(lengths(lst1))
    lapply(lst1, `length<-`, mx)}, .SDcols = patterns("_ID$")]
    a_ID   b_ID
   <num> <char>
1:     1     XY
2:     2   <NA>
3:    11   <NA>

We may also use collapse - select the columns (gvr), get the unique rows (funique), loop over the columns with dapply, replace the duplicates with NA

library(collapse)
dapply(funique(gvr(dt, "_ID$")), MARGIN = 2, 
   FUN = \(x) replace(x, duplicated(x), NA))
    a_ID   b_ID
   <num> <char>
1:     1     XY
2:     2   <NA>
3:    11   <NA>
  • Related