Home > Blockchain >  how to obtain overlapping min and max dates for multiple columns in a pandas df
how to obtain overlapping min and max dates for multiple columns in a pandas df

Time:03-27

I been asked to consolidate data and remove duplicate spans

from this:

ID  Start_Date  End_Date    Status
2   250027  20100101    20120718    TERMINATED
3   250027  20120719    20140729    SUSPENDED
4   250027  20140730    20140730    SUSPENDED
5   250027  20140731    20140907    ACTIVE
6   250027  20140908    20150528    SUSPENDED
7   250027  20150529    20180309    ACTIVE
8   250027  20180310    20190818    ACTIVE
9   250027  20190819    22000101    ACTIVE

to this:

ID  Start_Date  End_Date    Status
2   250027  20100101    20120718    TERMINATED
3   250027  20120719    20140730    SUSPENDED
4   250027  20140731    20140907    ACTIVE
5   250027  20140908    20150528    SUSPENDED
6   250027  20150529    22000101    ACTIVE

test.groupby(['ID','Status']).End_Date.agg([min,max])

min max
ID  Status      
250027  ACTIVE  20140907    22000101
      SUSPENDED 20140729    20150528
     TERMINATED 20120718    20120718

test.groupby(['ID','Status']).Start_Date.agg([min,max])
ID  Status      
250027  ACTIVE  20140731    20190819
      SUSPENDED 20120719    20140908
    TERMINATED  20100101    20100101

The python code is not consolidating incorrectly overlapping spans, not sure how i can get this to work.

CodePudding user response:

You may need a shift and cumsum create the sub key

new = df.Status.ne(df.Status.shift()).cumsum()
d = {'ID':'first', 'Start_Date':'min', 'End_Date':'max', 'Status':'first'}
out = df.groupby([df['ID'],new]).agg(d).reset_index(drop=True)
out
Out[122]: 
       ID  Start_Date  End_Date      Status
0  250027    20100101  20120718  TERMINATED
1  250027    20120719  20140730   SUSPENDED
2  250027    20140731  20140907      ACTIVE
3  250027    20140908  20150528   SUSPENDED
4  250027    20150529  22000101      ACTIVE
  • Related