Home > Back-end >  How to keep unique list-column values using data.table in R?
How to keep unique list-column values using data.table in R?

Time:03-23

I have a dataframe that identifies a set of values with an id. Let's say:

library(data.table)

dt <- data.table(
  id = rep(c("a", "b", "c"), each = 2),
  value1 = c(1, 1, 1, 2, 1, 1),
  value2 = c(0, 3, 0, 3, 0, 3)
)

As you can see, an id identifies many row of values, not a single one. What I want to do is to keep only the ids that identify the first occurrence of a group of values. For example, a and c in the dataframe above identify the same group of values, so I'd like the result to be:

dt[1:4] # desired output
#>    id value1 value2
#> 1:  a      1      0
#> 2:  a      1      3
#> 3:  b      1      0
#> 4:  b      2      3

I figured out that a simple way of doing so would be to nest the group of values into dataframes as well, and then keep only the unique entries based on this new nested dataframe column. Nesting is done quite simply:

dt <- dt[, .(data = list(.SD)), by = id]
dt
#>    id              data
#> 1:  a <data.table[2x2]>
#> 2:  b <data.table[2x2]>
#> 3:  c <data.table[2x2]>

But the actual "keeping unique occurrences" is quite tricky, apparently. I tried two different approaches, but both of them failed. First, using unique.data.table. But the by argument doesn't yet support list columns:

dt <- unique(dt, by = "data")
#> Error in forderv(x, by = by, sort = FALSE, retGrp = TRUE): Column 1 passed to [f]order is type 'list', not yet supported.

And then using .I[], but then I can't pass a list column to the by argument:

dt <- dt[dt[, .I[1], by = data]$V1]
#> Error: column or expression 1 of 'by' or 'keyby' is type list. Do not quote column names. Usage: DT[,sum(colC),by=list(colA,month(colB))]

I spent quite some time on this, but I can't seem to figure out how to achieve what I want. I'm not necessarily attached to the "nest -> keep unique dataframe" path, but this is the only way of solving the problem I could come up with.

CodePudding user response:

We may use duplicated with unnest

library(tidyr)
dt[, .(data = list(.SD)), by = id][!duplicated(data)] %>%
   unnest(data)

-output

# A tibble: 4 × 3
  id    value1 value2
  <chr>  <dbl>  <dbl>
1 a          1      0
2 a          1      3
3 b          1      0
4 b          2      3

CodePudding user response:

In the spirit of @akrun answer, but keeping data.table as the only dependency:

library(data.table)

dt <- data.table(
  id = rep(c("a", "b", "c"), each = 2),
  value1 = c(1, 1, 1, 2, 1, 1),
  value2 = c(0, 3, 0, 3, 0, 3)
)

dt <- dt[, .(data = list(.SD)), by = id]
dt <- dt[!duplicated(data)]
dt[, unlist(data, recursive = FALSE), by = id]
#>    id value1 value2
#> 1:  a      1      0
#> 2:  a      1      3
#> 3:  b      1      0
#> 4:  b      2      3
  • Related