I been asked to consolidate data and remove duplicate spans
from this:
ID Start_Date End_Date Status
2 250027 20100101 20120718 TERMINATED
3 250027 20120719 20140729 SUSPENDED
4 250027 20140730 20140730 SUSPENDED
5 250027 20140731 20140907 ACTIVE
6 250027 20140908 20150528 SUSPENDED
7 250027 20150529 20180309 ACTIVE
8 250027 20180310 20190818 ACTIVE
9 250027 20190819 22000101 ACTIVE
to this:
ID Start_Date End_Date Status
2 250027 20100101 20120718 TERMINATED
3 250027 20120719 20140730 SUSPENDED
4 250027 20140731 20140907 ACTIVE
5 250027 20140908 20150528 SUSPENDED
6 250027 20150529 22000101 ACTIVE
test.groupby(['ID','Status']).End_Date.agg([min,max])
min max
ID Status
250027 ACTIVE 20140907 22000101
SUSPENDED 20140729 20150528
TERMINATED 20120718 20120718
test.groupby(['ID','Status']).Start_Date.agg([min,max])
ID Status
250027 ACTIVE 20140731 20190819
SUSPENDED 20120719 20140908
TERMINATED 20100101 20100101
The python code is not consolidating incorrectly overlapping spans, not sure how i can get this to work.
CodePudding user response:
You may need a shift
and cumsum
create the sub key
new = df.Status.ne(df.Status.shift()).cumsum()
d = {'ID':'first', 'Start_Date':'min', 'End_Date':'max', 'Status':'first'}
out = df.groupby([df['ID'],new]).agg(d).reset_index(drop=True)
out
Out[122]:
ID Start_Date End_Date Status
0 250027 20100101 20120718 TERMINATED
1 250027 20120719 20140730 SUSPENDED
2 250027 20140731 20140907 ACTIVE
3 250027 20140908 20150528 SUSPENDED
4 250027 20150529 22000101 ACTIVE