I have the following dataset:
EID | CLEAN_NAME | Start_Date | End_Date |
---|---|---|---|
A111 | ABC | 2003-11-01 | 2005-12-31 |
A111 | DEF | 2005-12-01 | 2007-08-31 |
A111 | GHI | 2007-09-01 | 2012-01-31 |
A111 | GHI | 2012-01-01 | 2013-05-31 |
A111 | JKL | 2013-06-01 | 2015-03-31 |
A111 | MNO | 2015-04-01 | 2016-04-30 |
A111 | GHI | 2016-04-18 | 2017-11-07 |
A111 | GHI | 2017-11-01 | NaT |
Following is the expected outcome:
EID | CLEAN_NAME | Start_Date | End_Date |
---|---|---|---|
A111 | ABC | 2003-11-01 | 2005-12-31 |
A111 | DEF | 2005-12-01 | 2007-08-31 |
A111 | GHI | 2007-09-01 | 2013-05-31 |
A111 | JKL | 2013-06-01 | 2015-03-31 |
A111 | MNO | 2015-04-01 | 2016-04-30 |
A111 | GHI | 2016-04-18 |
What I am trying to do here is for company aggregating the company data with respect to their EID and CLEAN_NAME and taking minimum of Start_Date and maximum of End_Date. If I do simple aggregation for GHI company minimum of Start_Date will come as 2007-09-01 and maximum of End_Date will come as 2017-11-07 which is incorrect.
Basically I want to separate GHI company as they are two different data. I understand I may have to use nested for loops for EID, CLEAN_NAME and Start_Date and End_Date to compare the data between each row but I can't seem to get it to work as I am unable to understand the usage and logic to implement it in the loop. Here is what I started with(which is not much to go on but stil...):
for i in df['EID']:
for j in df['CLEAN_NAME']:
if j == j-1
for k in df['Start_Date']:
print(df['EID'])
This is just a sample to check how the nested for loop works(which it didn't because I got an error in line if j == j-1). I am new to python and still trying to figure out how things work.
Also note the dataset is already sorted with respect to EID, Start_date, End_date and CLEAN_NAME. So basically I have to merge consecutive duplicate rows with min of Start_date and max of End_date and the last company with min of Start_date and End_date as NULL since it indicates that the employee is currently working in this company..
Any leads would be appreciated thanks.!
CodePudding user response:
Well, here is one solution, which does not make use of loops.
df['groups'] = (df['CLEAN_NAME'].ne(df['CLEAN_NAME'].shift())).cumsum()
df = df.groupby(["groups"]).agg({"EID": "first", "CLEAN_NAME": "first", "Start_Date": min, "End_Date": max}).reset_index()
print(df[["EID", "CLEAN_NAME", "Start_Date", "End_Date"]])
In this approach, we created groups of rows which has the same CLEAN_NAME, occurring consecutively. And then, aggregate the df, based on the group.
The output looks like this
EID CLEAN_NAME Start_Date End_Date
0 A111 ABC 2003-11-01 2005-12-31
1 A111 DEF 2005-12-01 2007-08-31
2 A111 GHI 2007-09-01 2013-05-31
3 A111 JKL 2013-06-01 2015-03-31
4 A111 MNO 2015-04-01 2016-04-30
5 A111 GHI 2016-04-18 NaT
Does this work for you ?