Home > database >  Company Data formatting
Company Data formatting

Time:12-24

I have the following dataset:

EID CLEAN_NAME Start_Date End_Date
A111 ABC 2003-11-01 2005-12-31
A111 DEF 2005-12-01 2007-08-31
A111 GHI 2007-09-01 2012-01-31
A111 GHI 2012-01-01 2013-05-31
A111 JKL 2013-06-01 2015-03-31
A111 MNO 2015-04-01 2016-04-30
A111 GHI 2016-04-18 2017-11-07
A111 GHI 2017-11-01 NaT

Following is the expected outcome:

EID CLEAN_NAME Start_Date End_Date
A111 ABC 2003-11-01 2005-12-31
A111 DEF 2005-12-01 2007-08-31
A111 GHI 2007-09-01 2013-05-31
A111 JKL 2013-06-01 2015-03-31
A111 MNO 2015-04-01 2016-04-30
A111 GHI 2016-04-18

What I am trying to do here is for company aggregating the company data with respect to their EID and CLEAN_NAME and taking minimum of Start_Date and maximum of End_Date. If I do simple aggregation for GHI company minimum of Start_Date will come as 2007-09-01 and maximum of End_Date will come as 2017-11-07 which is incorrect.

Basically I want to separate GHI company as they are two different data. I understand I may have to use nested for loops for EID, CLEAN_NAME and Start_Date and End_Date to compare the data between each row but I can't seem to get it to work as I am unable to understand the usage and logic to implement it in the loop. Here is what I started with(which is not much to go on but stil...):

   for i in df['EID']:
        for j in df['CLEAN_NAME']:
             if j == j-1
        for k in df['Start_Date']:
             print(df['EID'])

This is just a sample to check how the nested for loop works(which it didn't because I got an error in line if j == j-1). I am new to python and still trying to figure out how things work.

Also note the dataset is already sorted with respect to EID, Start_date, End_date and CLEAN_NAME. So basically I have to merge consecutive duplicate rows with min of Start_date and max of End_date and the last company with min of Start_date and End_date as NULL since it indicates that the employee is currently working in this company..

Any leads would be appreciated thanks.!

CodePudding user response:

Well, here is one solution, which does not make use of loops.

df['groups'] = (df['CLEAN_NAME'].ne(df['CLEAN_NAME'].shift())).cumsum()
df = df.groupby(["groups"]).agg({"EID": "first", "CLEAN_NAME": "first", "Start_Date": min, "End_Date": max}).reset_index()
print(df[["EID", "CLEAN_NAME", "Start_Date", "End_Date"]])

In this approach, we created groups of rows which has the same CLEAN_NAME, occurring consecutively. And then, aggregate the df, based on the group.

The output looks like this

    EID CLEAN_NAME  Start_Date    End_Date
0  A111        ABC  2003-11-01  2005-12-31
1  A111        DEF  2005-12-01  2007-08-31
2  A111        GHI  2007-09-01  2013-05-31
3  A111        JKL  2013-06-01  2015-03-31
4  A111        MNO  2015-04-01  2016-04-30
5  A111        GHI  2016-04-18         NaT

Does this work for you ?

  • Related