Dates inclusive/overlapping in company data-CodePudding

Hi I have the following dataset:

EID	company	SD	ED
B12345	A11	1/1/2021	3/1/2021
B12345	B11	1/1/2021	1/20/2021
B12345	C11	1/21/2021	2/1/2021
B12345	C11	2/2/2021	3/1/2021
B12346	A11	1/1/2010	12/31/2021
B12346	B11	1/1/2011	12/31/2015
B12346	C11	1/1/2022	12/31/2022

As you may have observed the dates here for EID B12345 both companies B11 and C11 respectively are inclusive of dates for company A11. Similarly for EID B12346 dates for company B11 is inclusive of date for company A11. So what I want to do here is for all the EIDs that I have in the data frame I want to check for each company whether the Start Date (SD) and End Date (ED) are inclusive of previous company or not. And if they are inclusive then the final dates should be taken of the company whose dates lie in maximum range. And if the dates are different then take the data as it is. So the output should contain the dates which are for company A11.

Basically what I want to check here is whether there is any overlap between the two date columns for each eid with respect to company and if there is then take the maximum range else keep the data as it is. Also its not just comparing the company with the previous company, but also with the company before that if the previous company should be eliminated. The expected output is as follows:

EID	company	SD	ED
B12345	A11	1/1/2021	3/1/2021
B12346	A11	1/1/2010	12/31/2021
B12346	C11	1/1/2022	12/31/2022

Below is the code I have tried:

group = (df['SD'] <=  df.groupby(['EID'['END'].shift()).groupby([df['EID']]).cumsum()
df = df.loc[df.groupby([df['EID'], group])['company'].idxmax()].sort_index()

the error that I am getting here is:

TypeError: reduction operation 'argmax' not allowed for this dtype

I have been unable to understand what this error means. I even tried to change the data type of company column but still not able to fix it.

Any leads would be appreciated. Thanks.!

CodePudding user response：

Your question is not complete. You not only want to compare the company with the previous company, but also with the company before that if the previous company should be eliminated.

Otherwise, it would be a simple comparison:

df.groupby('EID').apply(
    lambda group: group[~(
        (group.SD >= group.SD.shift()) &
        (group.ED <= group.ED.shift())
    )]
)

However, this results in the following:

EID     company SD          ED
B12345  A11     1/1/2021    3/1/2021
B12345  C11     1/21/2021   2/1/2021
B12345  C11     2/2/2021    3/1/2021
B12346  A11     1/1/2010    12/31/2021
B12346  C11     1/1/2022    12/31/2022

Now, you want to compare C11 with A11. You can do this with a (not so elegant) while loop:

different = True
while different:
    df_orig = df.copy()
    df = df.groupby('EID').apply(lambda group: group[~((group.SD >= group.SD.shift()) & (group.ED <= group.ED.shift()))]).reset_index(drop=True)
    different = not df.equals(df_orig)

This will keep doing the same trick, and only stops if the resulting dataframe equals the original dataframe.

Resulting in:

EID     company SD          ED
B12345  A11     1/1/2021    3/1/2021
B12346  A11     1/1/2010    12/31/2021
B12346  C11     1/1/2022    12/31/2022