Home > Net >  How to assign a unique identifier to a sequence of records
How to assign a unique identifier to a sequence of records

Time:11-29

I have a dataframe df,

ID      MID  TimeType   Starttime             Endtime               next_Starttime                
333429  7    NEB        2021-11-01 20:45:17   2021-11-01 20:45:44   2021-11-01 20:45:44 
333430  7    AUF        2021-11-01 20:45:44   2021-11-01 21:00:00   2021-11-01 21:00:00 
333476  7    AUF        2021-11-01 21:00:00   2021-11-01 21:03:36   2021-11-01 21:03:36 
333477  7    NEB        2021-11-01 21:03:36   2021-11-01 21:11:43   2021-11-01 21:11:43 
333502  7    AUF        2021-11-01 21:11:43   2021-11-01 21:11:44   2021-11-01 21:11:44 
333511  7    AUF        2021-11-01 21:11:44   2021-11-01 21:25:01   2021-11-01 21:25:01 
333553  7    AUF        2021-11-01 21:25:01   2021-11-01 21:40:01   2021-11-01 21:40:01 

I would like to assign a unique id for the column TimeType based on the sequence of repetition such that the desired output looks like this,

ID      MID  TimeType   Starttime             Endtime               next_Starttime       unique_id          
333429  7    NEB        2021-11-01 20:45:17   2021-11-01 20:45:44   2021-11-01 20:45:44  1
333430  7    AUF        2021-11-01 20:45:44   2021-11-01 21:00:00   2021-11-01 21:00:00  2
333476  7    AUF        2021-11-01 21:00:00   2021-11-01 21:03:36   2021-11-01 21:03:36  2
333477  7    NEB        2021-11-01 21:03:36   2021-11-01 21:11:43   2021-11-01 21:11:43  3
333502  7    AUF        2021-11-01 21:11:43   2021-11-01 21:11:44   2021-11-01 21:11:44  4
333511  7    AUF        2021-11-01 21:11:44   2021-11-01 21:25:01   2021-11-01 21:25:01  4
333553  7    AUF        2021-11-01 21:25:01   2021-11-01 21:40:01   2021-11-01 21:40:01  4

I tried using for-loops but based on the dataframe size the execution is too slow.

CodePudding user response:

You can use shift to compare the consecutive rows and cumsum to make a count:

df['unique_id'] = df['TimeType'].ne(df['TimeType'].shift()).cumsum()

output:

       ID  MID TimeType            Starttime              Endtime       next_Starttime  unique_id
0  333429    7      NEB  2021-11-01 20:45:17  2021-11-01 20:45:44  2021-11-01 20:45:44          1
1  333430    7      AUF  2021-11-01 20:45:44  2021-11-01 21:00:00  2021-11-01 21:00:00          2
2  333476    7      AUF  2021-11-01 21:00:00  2021-11-01 21:03:36  2021-11-01 21:03:36          2
3  333477    7      NEB  2021-11-01 21:03:36  2021-11-01 21:11:43  2021-11-01 21:11:43          3
4  333502    7      AUF  2021-11-01 21:11:43  2021-11-01 21:11:44  2021-11-01 21:11:44          4
5  333511    7      AUF  2021-11-01 21:11:44  2021-11-01 21:25:01  2021-11-01 21:25:01          4
6  333553    7      AUF  2021-11-01 21:25:01  2021-11-01 21:40:01  2021-11-01 21:40:01          4

CodePudding user response:

You can do

df['uni_id'] = df['TimeType'].ne(df['TimeType']).cumsum()

CodePudding user response:

import pdrle

pdrle.get_id(df.TimeType)
# 0    0
# 1    1
# 2    1
# 3    2
# 4    3
# 5    3
# 6    3
# dtype: int64
  • Related