I have a dataframe that identifies a set of values with an id:
library(data.table)
dt <- data.table(
id = rep(c("a", "b", "c"), each = 2),
value1 = c(1, 1, 1, 2, 1, 1),
value2 = c(0, 3, 0, 3, 0, 3)
)
dt
#> id value1 value2
#> 1: a 1 0
#> 2: a 1 3
#> 3: b 1 0
#> 4: b 2 3
#> 5: c 1 0
#> 6: c 1 3
As you can see, the ids a
and c
identify both the same set of values. So I want to create a "pattern id", that identifies the set of values associated with the ids a
and c
.
I did manage to come up with a solution using nested data.tables and match()
:
dt <- dt[, .(data = list(.SD)), by = id]
unique_groups <- unique(dt$data)
dt[, pattern_id := match(data, unique_groups)]
dt[, data := NULL]
dt
#> id pattern_id
#> 1: a 1
#> 2: b 2
#> 3: c 1
It does the trick, but it is not as fast as I'd like it to be. match()
documentation is pretty clear regarding its efficiency with lists:
Matching for lists is potentially very slow and best avoided except in simple cases.
As you can see, I don't need the actual pattern data in my final result, only a table that associates the ids to the pattern ids. I feel like nesting the data, using it to match and then dropping it afterwards is a bit wasteful, but not sure if there's a better way. I was thinking in something that transform each dataframe into a string, or, even better, something that avoided the nesting altogether, but I couldn't come up with anything better than what I have now.
I have created a bigger dataset to play around with and test different solutions:
set.seed(0)
size <- 1000000
dt <- data.table(
id = rep(1:(size / 2), each = 2),
value1 = sample(1:10, size, replace = TRUE),
value2 = sample(1:10, size, replace = TRUE)
)
CodePudding user response:
We can try the code below
dt[
,
q := toString(unlist(.SD)), id
][
,
pattern_id := .GRP, q
][
,
q := NULL
][]
or
dt[
,
q := toString(unlist(.SD)),
id
][
,
pattern_id := as.integer(factor(match(q, q)))
][
,
q := NULL
][]
which gives
id value1 value2 pattern_id
1: a 1 0 1
2: a 1 3 1
3: b 1 0 2
4: b 2 3 2
5: c 1 0 1
6: c 1 3 1
CodePudding user response:
Assuming each id is repeated twice, "reshape" - convert 2x2 into 1x4 columns. Then get group ID using .GRP by grouping by all columns excluding id:
res <- dt[, c(.SD[ 1 ], .SD[ 2 ]), by = id]
setnames(res, make.unique(colnames(res)))
res[, pattern_id := .GRP, by = res[, -1] ][, .(id, pattern_id)]
# id pattern_id
# 1: 1 1
# 2: 2 2
# 3: 3 3
# 4: 4 4
# 5: 5 5
# ---
# 499996: 499996 1010
# 499997: 499997 3175
# 499998: 499998 3996
# 499999: 499999 3653
# 500000: 500000 4217
Using the bigger dataset takes about half a second.
Edit: another version using dcast, but it is 8x slower:
res <- dcast(dt, id ~ value1 value2, length)
res[, pattern_id :=.GRP, by = res[, -1] ][, .(id, pattern_id)]
CodePudding user response:
How about reshaping wider and using paste0()
?
library(dplyr)
library(tidyr)
dt <- dt %>% group_by(id) %>%
mutate(inst = row_number(id)) %>%
pivot_wider(values_from = c(value1, value2),
names_from = inst) %>%
mutate(pattern_id = paste0(value1_1, value1_2, value2_1, value2_2))
CodePudding user response:
With toString
, as suggested by data.table
error message when using a list as by
:
Column or expression 1 of 'by' is type 'list' which is not currently supported.
As a workaround, consider converting the column to a supported type, e.g. by=sapply(list_col, toString)
dt <- dt[, .(data = list(.SD)), by = id]
dt[, pattern_id :=.GRP, by = sapply(data, toString)]
dt[,unlist(data,recursive=F),by=.(id,pattern_id)]
id pattern_id value1 value2
<char> <int> <num> <num>
1: a 1 1 0
2: a 1 1 3
3: b 2 1 0
4: b 2 2 3
5: c 1 1 0
6: c 1 1 3
However, this is slower than match
.
CodePudding user response:
Updated (to remove join):
This one replicates your approach (i.e. it requires that the order is the same as well as the values)
unique(
dt[, pattern:=.(paste0(c(value1,value2), collapse=",")), by=id][,.(id,pattern)]
)[,grp:=.GRP, by=pattern][,pattern:=NULL]
id grp
<char> <int>
1: a 1
2: b 2
3: c 1
Prior solution:
dt[dt[, .(paste0(sort(c(value1,value2)), collapse=",")), by=id] %>%
.[,pattern:=.GRP, by=V1] %>%
.[,V1:=NULL], on=.(id)]
Output:
id value1 value2 pattern
<char> <num> <num> <int>
1: a 1 0 1
2: a 1 3 1
3: b 1 0 2
4: b 2 3 2
5: c 1 0 1
6: c 1 3 1