Home > Software design >  Tacking monthly activity and vacancy of employees ids over time in a pandas dataframe
Tacking monthly activity and vacancy of employees ids over time in a pandas dataframe

Time:10-18

I have the following dataframe that contains users ids over time (months):

DATE             USER_ID            ACTIVITY 
2021-06-01       A1                1
2021-06-01       B2                1
2021-06-01       C3                0
2021-07-01       A1                1
2021-07-01       B2                0
2021-07-01       C3                1
2021-08-01       A1                1
2021-08-01       B2                0
2021-08-01       C3                0
2021-09-01       A1                1
2021-09-01       B2                1
2021-09-01       C3                1

The values of the "ACTIVE" column just show 1 (the user is active) and 0 (user not active).

I want to create a new column that monthly traces whether the employee id got activity based on the following conditions:

*if last month id was 0 and current id is 0 then the label is "vacancy stock"

*if last month id was 0 and current id is 1 then the label is "new active"

*if last month id was 1 and current id is 0 then the label is "new vacancy"

*if last month id was 1 and current id is 1 then the label is "active stock"

Considering those conditions, this is the expected dataframe with the new column:

DATE             USER_ID            ACTIVITY          NEW_COLUMN
2021-06-01       A1                1                   NaN
2021-06-01       B2                1                   NaN
2021-06-01       C3                0                   NaN
2021-07-01       A1                1                   active stock
2021-07-01       B2                0                   new vacancy                   
2021-07-01       C3                1                   new active
2021-08-01       A1                1                   active stock
2021-08-01       B2                0                   vacancy stock
2021-08-01       C3                0                   new vacancy
2021-09-01       A1                1                   active stock
2021-09-01       B2                1                   new active
2021-09-01       C3                1                   new active

Thanks in advance for your help!

CodePudding user response:

You can use groupby.shift to access the previous period per user and combine the status on a binary-like encoding:

d = {0: 'vacancy stock', 1: 'new active', 10: 'new vacancy', 11: 'active stock'}

df['NEW_COLUMN'] = (df['ACTIVITY']
 .add(df.groupby('USER_ID')['ACTIVITY'].shift().mul(10))
 .map(d)
)

Output:

          DATE USER_ID  ACTIVITY     NEW_COLUMN
0   2021-06-01      A1         1            NaN
1   2021-06-01      B2         1            NaN
2   2021-06-01      C3         0            NaN
3   2021-07-01      A1         1   active stock
4   2021-07-01      B2         0    new vacancy
5   2021-07-01      C3         1     new active
6   2021-08-01      A1         1   active stock
7   2021-08-01      B2         0  vacancy stock
8   2021-08-01      C3         0    new vacancy
9   2021-09-01      A1         1   active stock
10  2021-09-01      B2         1     new active
11  2021-09-01      C3         1     new active

NB. For a real binary encoding you would multiply by 2 and use the dictionary: d = {0: 'vacancy stock', 1: 'new active', 2: 'new vacancy', 3: 'active stock'}.

  • Related