nth group labelling (count only graduates with new group)-CodePudding

Say, I have a df:

df <- data.table(user = c('a', 'a', 'b')
                 ); df

user
 a
 a
 b

I am looking to create a new column, group_id, at user level:

user group_id
 a        1
 a        1
 b        2

Is there a fast / scalable way? I can do this with a join (and understand data.table is fast with joins) but wonder if there is a simpler and faster method. Assume I have circa 1e8 rows and the method is to be applied to multiple columns in the group by (in example presented there is only 1 i.e. user)

I have started with:

df[, step_1 := as.integer(duplicated(df))]; df

user step_1
 a      0
 a      1
 b      0

but don't know what is next nor am I sure if this is even the right direction. Thank you.

CodePudding user response：

Convert user to factor. Then convert that to integer to extract the underlying codes that factor uses.

df[, group_id := as.integer(factor(user))]
df
##    user group_id
## 1:    a        1
## 2:    a        1
## 3:    b        2

CodePudding user response：

data.table’s rowid() function will also work with multiple grouping columns

df[, group_id := rowid(user)]

user group_id
   a        1
   a        1
   b        2

CodePudding user response：

We may also use match

df[, group_id := match(user, unique(user))]