Say, I have a df
:
df <- data.table(user = c('a', 'a', 'b')
); df
user
a
a
b
I am looking to create a new column, group_id
, at user
level:
user group_id
a 1
a 1
b 2
Is there a fast / scalable way? I can do this with a join (and understand data.table
is fast with joins) but wonder if there is a simpler and faster method.
Assume I have circa 1e8 rows and the method is to be applied to multiple columns in the group by (in example presented there is only 1 i.e. user
)
I have started with:
df[, step_1 := as.integer(duplicated(df))]; df
user step_1
a 0
a 1
b 0
but don't know what is next nor am I sure if this is even the right direction. Thank you.
CodePudding user response:
Convert user to factor. Then convert that to integer to extract the underlying codes that factor uses.
df[, group_id := as.integer(factor(user))]
df
## user group_id
## 1: a 1
## 2: a 1
## 3: b 2
CodePudding user response:
data.table’s rowid()
function will also work with multiple grouping columns
df[, group_id := rowid(user)]
user group_id a 1 a 1 b 2
CodePudding user response:
We may also use match
df[, group_id := match(user, unique(user))]