Home > Enterprise >  How to create an indicator column to indicate specific change from a previous entry in a dataframe w
How to create an indicator column to indicate specific change from a previous entry in a dataframe w

Time:10-25

I have a dataframe of clients CLIENT_ID as shown below:

CLIENT_ID CURRENT_DATE_STATUS STATUS
10002 2017-07-21 STARTED
10002 2017-07-21 STARTED
10002 2018-07-01 CHURNED
10002 2018-07-01 CHURNED
10002 2019-01-01 RESTARTED
11811 2019-08-15 STARTED
11811 2019-08-15 STARTED
11811 2019-12-31 RESTARTED
22101 2020-03-11 STARTED
22101 2020-03-11 STARTED
22101 2020-03-11 STARTED
22101 2020-11-01 CHURNED
22300 2018-05-06 STARTED
22300 2018-05-06 STARTED

The dataframe is sorted by CLIENT_ID and CURRENT_DATE_STATUS. How can I create an indicator Boolean 1 or 0 column which indicates:

  • If the previous STATUS entry has changed to either CHURNED or RESTARTED for each CLIENT_ID.

The resultant dataframe would be as shown below:

CLIENT_ID CURRENT_DATE_STATUS STATUS STOPPED
10002 2017-07-21 STARTED 0
10002 2017-07-21 STARTED 0
10002 2018-07-01 CHURNED 1
10002 2018-07-01 CHURNED 0
10002 2019-01-01 RESTARTED 1
11811 2019-08-15 STARTED 0
11811 2019-08-15 STARTED 0
11811 2019-12-31 RESTARTED 1
22101 2020-03-11 STARTED 0
22101 2020-03-11 STARTED 0
22101 2020-03-11 STARTED 0
22101 2020-11-01 CHURNED 1
22300 2018-05-06 STARTED 0
22300 2018-05-06 STARTED 0

here is code to generate the dataframe

import pandas as pd

data = {'CLIENT_ID':[10002,10002,10002,10002,10002,11811,11811,11811,22101,22101,22101,22101,22300,22300],
'CURRENT_DATE_STATUS':['2017-07-21','2017-07-21','2018-07-01','2018-07-01','2019-07-01','2019-08-15','2019-08-15','2019-12-31','2020-03-11','2020-03-11','2020-03-11','2020-11-01','2018-05-06','2018-05-06'],
'STATUS':['STARTED','STARTED','CHURNED','CHURNED','RESTARTED','STARTED','STARTED','RESTARTED','STARTED','STARTED','STARTED','CHURNED','STARTED','STARTED']}
df = pd.DataFrame(data)

CodePudding user response:

You can check whether the current value equals to previous value:

cond = lambda d: (d.shift().ne(d)&d.isin(["CHURNED", "RESTARTED"])).astype(int)
df["STOPPED"] = df.groupby("CLIENT_ID")["STATUS"].transform(cond)

print (df)

    CLIENT_ID CURRENT_DATE_STATUS     STATUS  STOPPED
0       10002          2017-07-21    STARTED        0
1       10002          2017-07-21    STARTED        0
2       10002          2018-07-01    CHURNED        1
3       10002          2018-07-01    CHURNED        0
4       10002          2019-07-01  RESTARTED        1
5       11811          2019-08-15    STARTED        0
6       11811          2019-08-15    STARTED        0
7       11811          2019-12-31  RESTARTED        1
8       22101          2020-03-11    STARTED        0
9       22101          2020-03-11    STARTED        0
10      22101          2020-03-11    STARTED        0
11      22101          2020-11-01    CHURNED        1
12      22300          2018-05-06    STARTED        0
13      22300          2018-05-06    STARTED        0

CodePudding user response:

You can compare actual values for eqaul by Series.eq with shifted per groups by DataFrameGroupBy.shift for not equalSeries.ne, chain by & for bitwise AND and last chain by | for bitwise OR with casting to integers:

s = df.groupby('CLIENT_ID')['STATUS'].shift()
m1 = df['STATUS'].eq('RESTARTED') & s.ne('RESTARTED')
m2 = df['STATUS'].eq('CHURNED') & s.ne('CHURNED')

df['STOPPED'] = (m1 | m2).astype(int)
print (df)
    CLIENT_ID CURRENT_DATE_STATUS     STATUS  STOPPED
0       10002          2017-07-21    STARTED        0
1       10002          2017-07-21    STARTED        0
2       10002          2018-07-01    CHURNED        1
3       10002          2018-07-01    CHURNED        0
4       10002          2019-07-01  RESTARTED        1
5       11811          2019-08-15    STARTED        0
6       11811          2019-08-15    STARTED        0
7       11811          2019-12-31  RESTARTED        1
8       22101          2020-03-11    STARTED        0
9       22101          2020-03-11    STARTED        0
10      22101          2020-03-11    STARTED        0
11      22101          2020-11-01    CHURNED        1
12      22300          2018-05-06    STARTED        0
13      22300          2018-05-06    STARTED        0

Another solution is compare shifted values by previous and then if match by list in Series.isin, last chain by & for bitwise AND:

m3 = df.groupby('CLIENT_ID')['STATUS'].shift().ne(df['STATUS'])
m4 = df['STATUS'].isin(["CHURNED", "RESTARTED"])

df['STOPPED'] = (m3 & m4).astype(int)
print (df)

    CLIENT_ID CURRENT_DATE_STATUS     STATUS  STOPPED
0       10002          2017-07-21    STARTED        0
1       10002          2017-07-21    STARTED        0
2       10002          2018-07-01    CHURNED        1
3       10002          2018-07-01    CHURNED        0
4       10002          2019-07-01  RESTARTED        1
5       11811          2019-08-15    STARTED        0
6       11811          2019-08-15    STARTED        0
7       11811          2019-12-31  RESTARTED        1
8       22101          2020-03-11    STARTED        0
9       22101          2020-03-11    STARTED        0
10      22101          2020-03-11    STARTED        0
11      22101          2020-11-01    CHURNED        1
12      22300          2018-05-06    STARTED        0
13      22300          2018-05-06    STARTED        0
  • Related