Home > Software design >  nth group labelling (count only graduates with new group)
nth group labelling (count only graduates with new group)

Time:10-16

Say, I have a df:

df <- data.table(user = c('a', 'a', 'b')
                 ); df

user
 a
 a
 b

I am looking to create a new column, group_id, at user level:

user group_id
 a        1
 a        1
 b        2

Is there a fast / scalable way? I can do this with a join (and understand data.table is fast with joins) but wonder if there is a simpler and faster method. Assume I have circa 1e8 rows and the method is to be applied to multiple columns in the group by (in example presented there is only 1 i.e. user)

I have started with:

df[, step_1 := as.integer(duplicated(df))]; df

user step_1
 a      0
 a      1
 b      0

but don't know what is next nor am I sure if this is even the right direction. Thank you.

CodePudding user response:

Convert user to factor. Then convert that to integer to extract the underlying codes that factor uses.

df[, group_id := as.integer(factor(user))]
df
##    user group_id
## 1:    a        1
## 2:    a        1
## 3:    b        2

CodePudding user response:

’s rowid() function will also work with multiple grouping columns

df[, group_id := rowid(user)]
user group_id
   a        1
   a        1
   b        2

CodePudding user response:

We may also use match

df[, group_id := match(user, unique(user))]
  • Related