Home > front end >  pandas, create a new unique identifier column based on values from two other columns
pandas, create a new unique identifier column based on values from two other columns

Time:05-13

I am having a data frame similar to the one below. I wanted to create a new unique identifier column based on other columns. My new column should be a concatenation of district number, plus store number, plus string zero("0"), and incremental count value.

employee    date    district    store
0   1234    2021-12-1   336         450
1   1234    2021-12-1   336         450
2   1234    2021-12-2   336         450
3   5678    2021-12-1   336         650
4   5678    2021-12-2   336         650
5   5678    2021-12-3   336         650

P.S: If an employee is having multiple rows on the same day, then the shiftID value should be the same, if not, then the shiftID value should increment by 1. And if the store changes then again this incremental value should start from zero.

I want output to be like the below dataframe

employee    date    district    store   shiftID
0   1234    2021-12-1   336      450    33645000
1   1234    2021-12-1   336      450    33645000
2   1234    2021-12-2   336      450    33645001
3   5678    2021-12-1   336      650    33665000
4   5678    2021-12-2   336      650    33665001
5   5678    2021-12-3   336      650    33665002

I tried using the following code,

df['shiftid'] = df['district'] df['store'] '0'  df.groupby(['employee','date']).cumcount().astype(str)

this is not the output that i wanted

employee    date    district    store   shiftid
0   1234    2021-12-1   336      450    33645000
1   1234    2021-12-1   336      450    33645001
2   1234    2021-12-2   336      450    33645000
3   5678    2021-12-1   336      650    33665000
4   5678    2021-12-2   336      650    33665000
5   5678    2021-12-3   336      650    33665000

Any help will be appreciated. Thanks in advance!

CodePudding user response:

You could do it like this:

df['day_id'] = df.groupby(['employee', 'date']).ngroup()
df['day_id'] -= df.groupby('employee')['day_id'].transform('min')

df['shiftid'] = df['district']   df['store']   '0'   df['day_id'].astype(str)

print(df.drop(columns=['day_id']))
  employee       date district store   shiftid
0     1234  2021-12-1      336   450  33645000
1     1234  2021-12-1      336   450  33645000
2     1234  2021-12-2      336   450  33645001
3     5678  2021-12-1      336   650  33665000
4     5678  2021-12-2      336   650  33665001
5     5678  2021-12-2      336   650  33665001

Note: the last date in your "expected output" is different than the input, that's why the last shiftid is different. In case you had 2021-12-3 in the input this is the result:

  employee       date district store   shiftid
0     1234  2021-12-1      336   450  33645000
1     1234  2021-12-1      336   450  33645000
2     1234  2021-12-2      336   450  33645001
3     5678  2021-12-1      336   650  33665000
4     5678  2021-12-2      336   650  33665001
5     5678  2021-12-3      336   650  33665002

Which matches your desired output.

CodePudding user response:

here is one way using rank():

df['shiftID'] = df['district'].map(str)   df['store'].map(str) \
            df.groupby(['employee'])['date'].rank(method="dense").sub(1).map(int).map(str).str.zfill(2)

output:

>>>
   employee       date  district  store   shiftID
0      1234  2021-12-1       336    450  33645000
1      1234  2021-12-1       336    450  33645000
2      1234  2021-12-2       336    450  33645001
3      5678  2021-12-1       336    650  33665000
4      5678  2021-12-2       336    650  33665001
5      5678  2021-12-3       336    650  33665002
  • Related