Company Data formatting-CodePudding

I have the following dataset:

EID	CLEAN_NAME	Start_Date	End_Date
A111	ABC	2003-11-01	2005-12-31
A111	DEF	2005-12-01	2007-08-31
A111	GHI	2007-09-01	2012-01-31
A111	GHI	2012-01-01	2013-05-31
A111	JKL	2013-06-01	2015-03-31
A111	MNO	2015-04-01	2016-04-30
A111	GHI	2016-04-18	2017-11-07
A111	GHI	2017-11-01	NaT

Following is the expected outcome:

EID	CLEAN_NAME	Start_Date	End_Date
A111	ABC	2003-11-01	2005-12-31
A111	DEF	2005-12-01	2007-08-31
A111	GHI	2007-09-01	2013-05-31
A111	JKL	2013-06-01	2015-03-31
A111	MNO	2015-04-01	2016-04-30
A111	GHI	2016-04-18

What I am trying to do here is for company aggregating the company data with respect to their EID and CLEAN_NAME and taking minimum of Start_Date and maximum of End_Date. If I do simple aggregation for GHI company minimum of Start_Date will come as 2007-09-01 and maximum of End_Date will come as 2017-11-07 which is incorrect.

Basically I want to separate GHI company as they are two different data. I understand I may have to use nested for loops for EID, CLEAN_NAME and Start_Date and End_Date to compare the data between each row but I can't seem to get it to work as I am unable to understand the usage and logic to implement it in the loop. Here is what I started with(which is not much to go on but stil...):

   for i in df['EID']:
        for j in df['CLEAN_NAME']:
             if j == j-1
        for k in df['Start_Date']:
             print(df['EID'])

This is just a sample to check how the nested for loop works(which it didn't because I got an error in line if j == j-1). I am new to python and still trying to figure out how things work.

Also note the dataset is already sorted with respect to EID, Start_date, End_date and CLEAN_NAME. So basically I have to merge consecutive duplicate rows with min of Start_date and max of End_date and the last company with min of Start_date and End_date as NULL since it indicates that the employee is currently working in this company..

Any leads would be appreciated thanks.!

CodePudding user response：

Well, here is one solution, which does not make use of loops.

df['groups'] = (df['CLEAN_NAME'].ne(df['CLEAN_NAME'].shift())).cumsum()
df = df.groupby(["groups"]).agg({"EID": "first", "CLEAN_NAME": "first", "Start_Date": min, "End_Date": max}).reset_index()
print(df[["EID", "CLEAN_NAME", "Start_Date", "End_Date"]])

In this approach, we created groups of rows which has the same CLEAN_NAME, occurring consecutively. And then, aggregate the df, based on the group.

The output looks like this

    EID CLEAN_NAME  Start_Date    End_Date
0  A111        ABC  2003-11-01  2005-12-31
1  A111        DEF  2005-12-01  2007-08-31
2  A111        GHI  2007-09-01  2013-05-31
3  A111        JKL  2013-06-01  2015-03-31
4  A111        MNO  2015-04-01  2016-04-30
5  A111        GHI  2016-04-18         NaT

Does this work for you ?