How to get data from previous dataframe row based on conditions?-CodePudding

I have the following current dataframe and desired dataframe shown below. I have tried a few method but was unable to get my desired output.

data = [['Period', 'Company/Title', 'Personnel No.', 'Start Date', 'End Date'],
   ['01/01/1980 - 31/12/1982', 'AAA', '0', '01/01/1980', '31/12/1982'],
   ['01/01/1980', 'Typist', '0', '01/01/1980', ''],
   ['01/07/1990 - 31/05/1994', 'BBB', '0', '01/07/1990', '31/05/1994'],
   ['01/07/1990', 'Clerk 1', '0', '01/07/1990', ''],
   ['01/01/1994', 'Clerk 2', '0', '01/01/1994', ''],
   ['01/12/1993 - 05/06/1994', 'ZZZ', '1', '01/12/1993', '05/06/1994'],
   ['01/12/1993', 'ZZZ', '1', '01/12/1993', '']]

data_df = pd.DataFrame(data[1:], columns = [data[0]])
data_df

desire_output = [['Period', 'Company/Title', 'Personnel No.', 'Start Date', 'End Date'],
       ['01/01/1980 - 31/12/1982', 'AAA', '0', '01/01/1980', '31/12/1982'],
       ['01/01/1980', 'AAA - Typist', '0', '01/01/1980', '31/12/1982'],
       ['01/07/1990 - 31/05/1994', 'BBB', '0', '01/07/1990', '31/05/1994'],
       ['01/07/1990', 'BBB - Clerk 1', '0', '01/07/1990', '31/12/1993'],
       ['01/01/1994', 'BBB - Clerk 2', '0', '01/01/1994', '31/05/1994'],
       ['01/12/1993 - 05/06/1994', 'ZZZ', '1', '01/12/1993', '05/06/1994'],
       ['01/12/1993', 'ZZZ - Executive', '1', '01/12/1993', '05/06/1994']]
desire_output_df = pd.DataFrame(desire_output[1:], columns = [desire_output[0]])
desire_output_df

Can imagine the data is from linkedin in where the first row contains the total period a person is with a organisation. Next 2 rows should the breakdown of each row within the same organisation.
Row with title will get company name from row with period that contains dash "-"
For personnel no. with multiple role within the same company, the end date would be the previous row date - day.

I did a indx, row in data that check if "-" in row['Period'] and get the company name and manage to store the company name with the role (e.g. "AAA- Typist" but unable to solve the rest. Appreciate any advice on how can i do this.

Thank you!

CodePudding user response：

I would like to suggest you to migrate your data into a proper database.

Anyway, here is a working code. The key idea for end date process is using shift instead of loop. Typo-like values in the original question are changed.

The code is quite ugly and there may be better method to solve your question.

1. Load data

import pandas as pd
from datetime import timedelta

data = [['Period', 'Company/Title', 'Personal No.', 'Start Date', 'End Date'],
   ['01/01/1980 - 31/12/1982', 'AAA', '0', '01/01/1980', '31/12/1982'],
   ['01/01/1980', 'Typist', '0', '01/01/1980', ''],
   ['01/07/1990 - 31/05/1994', 'BBB', '0', '01/07/1990', '31/05/1994'],
   ['01/07/1990', 'Clerk 1', '0', '01/07/1990', ''],
   ['01/01/1994', 'Clerk 2', '0', '01/01/1994', ''],
   ['01/12/1993 - 05/06/1994', 'ZZZ', '1', '01/12/1993', '05/06/1994'],
   ['01/12/1993', 'Executive', '1', '01/12/1993', '']]

df = pd.DataFrame(data[1:], columns = [data[0]])

Result

                    Period Company/Title Personal No.  Start Date    End Date
0  01/01/1980 - 31/12/1982           AAA            0  01/01/1980  31/12/1982
1               01/01/1980        Typist            0  01/01/1980            
2  01/07/1990 - 31/05/1994           BBB            0  01/07/1990  31/05/1994
3               01/07/1990       Clerk 1            0  01/07/1990            
4               01/01/1994       Clerk 2            0  01/01/1994            
5  01/12/1993 - 05/06/1994           ZZZ            1  01/12/1993  05/06/1994
6               01/12/1993     Executive            1  01/12/1993

2. Process for Company/Title

df['Company'] = df['Company/Title'].loc[df.apply(lambda row: '-' in row['Period'], axis=1)]
df['Company'] = df['Company'].fillna(method='ffill')

df['Title'] = df['Company/Title'].loc[df.apply(lambda row: '-' not in row['Period'], axis=1)]

df['Company/Title'] = df.apply(lambda row: row['Company'] if pd.isna(row['Title']) else row['Company']   ' - '   row['Title'], axis=1)

Result

                    Period    Company/Title Personal No.  Start Date  \
0  01/01/1980 - 31/12/1982              AAA            0  01/01/1980   
1               01/01/1980     AAA - Typist            0  01/01/1980   
2  01/07/1990 - 31/05/1994              BBB            0  01/07/1990   
3               01/07/1990    BBB - Clerk 1            0  01/07/1990   
4               01/01/1994    BBB - Clerk 2            0  01/01/1994   
5  01/12/1993 - 05/06/1994              ZZZ            1  01/12/1993   
6               01/12/1993  ZZZ - Executive            1  01/12/1993   

     End Date Company      Title  
0  31/12/1982     AAA        NaN  
1                 AAA     Typist  
2  31/05/1994     BBB        NaN  
3                 BBB    Clerk 1  
4                 BBB    Clerk 2  
5  05/06/1994     ZZZ        NaN  
6                 ZZZ  Executive

3. Process for End Date

df['End Date'] = df['End Date'].replace('', method='ffill')
df['Start Date Offset Next'] = (pd.to_datetime(df['Start Date'].squeeze(), format='%d/%m/%Y') - timedelta(days=1)).shift(-1).squeeze().dt.strftime('%d/%m/%Y')

df['Personal No. Next'] = df['Personal No.'].shift(-1)
df['Company Next'] = df['Company'].shift(-1)
df['Title Next'] = df['Title'].shift(-1)

idx_change_end_date = df.apply(lambda row: (row['Personal No.'] == row['Personal No. Next']) & (row['Company'] == row['Company Next']) & (row['Title'] != row['Title Next']) & pd.notna(row['Title']) & pd.notna(row['Title Next']), axis=1)

df.loc[idx_change_end_date, 'End Date'] = df.loc[idx_change_end_date, 'Start Date Offset Next'].squeeze()

df = df.drop(['Company', 'Title', 'Start Date Offset Next', 'Personal No. Next', 'Company Next', 'Title Next'], level=0, axis=1)

Result

                    Period    Company/Title Personal No.  Start Date  \
0  01/01/1980 - 31/12/1982              AAA            0  01/01/1980   
1               01/01/1980     AAA - Typist            0  01/01/1980   
2  01/07/1990 - 31/05/1994              BBB            0  01/07/1990   
3               01/07/1990    BBB - Clerk 1            0  01/07/1990   
4               01/01/1994    BBB - Clerk 2            0  01/01/1994   
5  01/12/1993 - 05/06/1994              ZZZ            1  01/12/1993   
6               01/12/1993  ZZZ - Executive            1  01/12/1993   

     End Date  
0  31/12/1982  
1  31/12/1982  
2  31/05/1994  
3  31/12/1993  
4  31/05/1994  
5  05/06/1994  
6  05/06/1994