I have a dataframe of clients CLIENT_ID
as shown below:
CLIENT_ID | CURRENT_DATE_STATUS | STATUS |
---|---|---|
10002 | 2017-07-21 | STARTED |
10002 | 2017-07-21 | STARTED |
10002 | 2018-07-01 | CHURNED |
10002 | 2018-07-01 | CHURNED |
10002 | 2019-01-01 | RESTARTED |
11811 | 2019-08-15 | STARTED |
11811 | 2019-08-15 | STARTED |
11811 | 2019-12-31 | RESTARTED |
22101 | 2020-03-11 | STARTED |
22101 | 2020-03-11 | STARTED |
22101 | 2020-03-11 | STARTED |
22101 | 2020-11-01 | CHURNED |
22300 | 2018-05-06 | STARTED |
22300 | 2018-05-06 | STARTED |
The dataframe is sorted by CLIENT_ID and CURRENT_DATE_STATUS
. How can I create an indicator Boolean 1 or 0
column which indicates:
- If the previous
STATUS
entry has changed to eitherCHURNED or RESTARTED
for eachCLIENT_ID
.
The resultant dataframe would be as shown below:
CLIENT_ID | CURRENT_DATE_STATUS | STATUS | STOPPED |
---|---|---|---|
10002 | 2017-07-21 | STARTED | 0 |
10002 | 2017-07-21 | STARTED | 0 |
10002 | 2018-07-01 | CHURNED | 1 |
10002 | 2018-07-01 | CHURNED | 0 |
10002 | 2019-01-01 | RESTARTED | 1 |
11811 | 2019-08-15 | STARTED | 0 |
11811 | 2019-08-15 | STARTED | 0 |
11811 | 2019-12-31 | RESTARTED | 1 |
22101 | 2020-03-11 | STARTED | 0 |
22101 | 2020-03-11 | STARTED | 0 |
22101 | 2020-03-11 | STARTED | 0 |
22101 | 2020-11-01 | CHURNED | 1 |
22300 | 2018-05-06 | STARTED | 0 |
22300 | 2018-05-06 | STARTED | 0 |
here is code to generate the dataframe
import pandas as pd
data = {'CLIENT_ID':[10002,10002,10002,10002,10002,11811,11811,11811,22101,22101,22101,22101,22300,22300],
'CURRENT_DATE_STATUS':['2017-07-21','2017-07-21','2018-07-01','2018-07-01','2019-07-01','2019-08-15','2019-08-15','2019-12-31','2020-03-11','2020-03-11','2020-03-11','2020-11-01','2018-05-06','2018-05-06'],
'STATUS':['STARTED','STARTED','CHURNED','CHURNED','RESTARTED','STARTED','STARTED','RESTARTED','STARTED','STARTED','STARTED','CHURNED','STARTED','STARTED']}
df = pd.DataFrame(data)
CodePudding user response:
You can check whether the current value equals to previous value:
cond = lambda d: (d.shift().ne(d)&d.isin(["CHURNED", "RESTARTED"])).astype(int)
df["STOPPED"] = df.groupby("CLIENT_ID")["STATUS"].transform(cond)
print (df)
CLIENT_ID CURRENT_DATE_STATUS STATUS STOPPED
0 10002 2017-07-21 STARTED 0
1 10002 2017-07-21 STARTED 0
2 10002 2018-07-01 CHURNED 1
3 10002 2018-07-01 CHURNED 0
4 10002 2019-07-01 RESTARTED 1
5 11811 2019-08-15 STARTED 0
6 11811 2019-08-15 STARTED 0
7 11811 2019-12-31 RESTARTED 1
8 22101 2020-03-11 STARTED 0
9 22101 2020-03-11 STARTED 0
10 22101 2020-03-11 STARTED 0
11 22101 2020-11-01 CHURNED 1
12 22300 2018-05-06 STARTED 0
13 22300 2018-05-06 STARTED 0
CodePudding user response:
You can compare actual values for eqaul by Series.eq
with shifted per groups by DataFrameGroupBy.shift
for not equalSeries.ne
, chain by &
for bitwise AND
and last chain by |
for bitwise OR
with casting to integers:
s = df.groupby('CLIENT_ID')['STATUS'].shift()
m1 = df['STATUS'].eq('RESTARTED') & s.ne('RESTARTED')
m2 = df['STATUS'].eq('CHURNED') & s.ne('CHURNED')
df['STOPPED'] = (m1 | m2).astype(int)
print (df)
CLIENT_ID CURRENT_DATE_STATUS STATUS STOPPED
0 10002 2017-07-21 STARTED 0
1 10002 2017-07-21 STARTED 0
2 10002 2018-07-01 CHURNED 1
3 10002 2018-07-01 CHURNED 0
4 10002 2019-07-01 RESTARTED 1
5 11811 2019-08-15 STARTED 0
6 11811 2019-08-15 STARTED 0
7 11811 2019-12-31 RESTARTED 1
8 22101 2020-03-11 STARTED 0
9 22101 2020-03-11 STARTED 0
10 22101 2020-03-11 STARTED 0
11 22101 2020-11-01 CHURNED 1
12 22300 2018-05-06 STARTED 0
13 22300 2018-05-06 STARTED 0
Another solution is compare shifted values by previous and then if match by list in Series.isin
, last chain by &
for bitwise AND
:
m3 = df.groupby('CLIENT_ID')['STATUS'].shift().ne(df['STATUS'])
m4 = df['STATUS'].isin(["CHURNED", "RESTARTED"])
df['STOPPED'] = (m3 & m4).astype(int)
print (df)
CLIENT_ID CURRENT_DATE_STATUS STATUS STOPPED
0 10002 2017-07-21 STARTED 0
1 10002 2017-07-21 STARTED 0
2 10002 2018-07-01 CHURNED 1
3 10002 2018-07-01 CHURNED 0
4 10002 2019-07-01 RESTARTED 1
5 11811 2019-08-15 STARTED 0
6 11811 2019-08-15 STARTED 0
7 11811 2019-12-31 RESTARTED 1
8 22101 2020-03-11 STARTED 0
9 22101 2020-03-11 STARTED 0
10 22101 2020-03-11 STARTED 0
11 22101 2020-11-01 CHURNED 1
12 22300 2018-05-06 STARTED 0
13 22300 2018-05-06 STARTED 0