I am having a data frame similar to the one below. I wanted to create a new unique identifier column based on other columns. My new column should be a concatenation of district number, plus store number, plus string zero("0"), and incremental count value.
employee date district store
0 1234 2021-12-1 336 450
1 1234 2021-12-1 336 450
2 1234 2021-12-2 336 450
3 5678 2021-12-1 336 650
4 5678 2021-12-2 336 650
5 5678 2021-12-3 336 650
P.S: If an employee is having multiple rows on the same day, then the shiftID value should be the same, if not, then the shiftID value should increment by 1. And if the store changes then again this incremental value should start from zero.
I want output to be like the below dataframe
employee date district store shiftID
0 1234 2021-12-1 336 450 33645000
1 1234 2021-12-1 336 450 33645000
2 1234 2021-12-2 336 450 33645001
3 5678 2021-12-1 336 650 33665000
4 5678 2021-12-2 336 650 33665001
5 5678 2021-12-3 336 650 33665002
I tried using the following code,
df['shiftid'] = df['district'] df['store'] '0' df.groupby(['employee','date']).cumcount().astype(str)
this is not the output that i wanted
employee date district store shiftid
0 1234 2021-12-1 336 450 33645000
1 1234 2021-12-1 336 450 33645001
2 1234 2021-12-2 336 450 33645000
3 5678 2021-12-1 336 650 33665000
4 5678 2021-12-2 336 650 33665000
5 5678 2021-12-3 336 650 33665000
Any help will be appreciated. Thanks in advance!
CodePudding user response:
You could do it like this:
df['day_id'] = df.groupby(['employee', 'date']).ngroup()
df['day_id'] -= df.groupby('employee')['day_id'].transform('min')
df['shiftid'] = df['district'] df['store'] '0' df['day_id'].astype(str)
print(df.drop(columns=['day_id']))
employee date district store shiftid
0 1234 2021-12-1 336 450 33645000
1 1234 2021-12-1 336 450 33645000
2 1234 2021-12-2 336 450 33645001
3 5678 2021-12-1 336 650 33665000
4 5678 2021-12-2 336 650 33665001
5 5678 2021-12-2 336 650 33665001
Note: the last date in your "expected output" is different than the input, that's why the last shiftid is different. In case you had 2021-12-3 in the input this is the result:
employee date district store shiftid
0 1234 2021-12-1 336 450 33645000
1 1234 2021-12-1 336 450 33645000
2 1234 2021-12-2 336 450 33645001
3 5678 2021-12-1 336 650 33665000
4 5678 2021-12-2 336 650 33665001
5 5678 2021-12-3 336 650 33665002
Which matches your desired output.
CodePudding user response:
here is one way using rank()
:
df['shiftID'] = df['district'].map(str) df['store'].map(str) \
df.groupby(['employee'])['date'].rank(method="dense").sub(1).map(int).map(str).str.zfill(2)
output:
>>>
employee date district store shiftID
0 1234 2021-12-1 336 450 33645000
1 1234 2021-12-1 336 450 33645000
2 1234 2021-12-2 336 450 33645001
3 5678 2021-12-1 336 650 33665000
4 5678 2021-12-2 336 650 33665001
5 5678 2021-12-3 336 650 33665002