Home > database >  Merge rows and summerize them based on date range condition on Pandas
Merge rows and summerize them based on date range condition on Pandas

Time:12-15

I have a pandas dataframe which includes HR data. For each person, there are some records which I would like to merge based on 'VALID_FROM' column. Each row should consists of one single VALID_FROM with multiple VALID_TO and CODE values.

Here is an example of what I'm working on:

ID      NAME   FAMILY  VALID_FROM   VALID_TO    CODE

000     Mike   Jones   2023-03-01   2260-01-01  115
000     Mike   Jones   2023-03-01   2260-01-01  102
000     Mike   Jones   2022-11-01   2023-02-28  102
000     Mike   Jones   2023-03-01   2260-01-01  101
000     Mike   Jones   2022-10-01   2022-10-31  102
000     Mike   Jones   2022-10-01   2022-10-31  115
000     Mike   Jones   2022-11-01   2023-02-28  115
000     Mike   Jones   2022-10-01   2023-02-28  101

And here is what I would like to have:

ID      NAME   FAMILY  VALID_FROM_1   VALID_TO_1  CODE_1 VALID_TO_2  CODE_2 VALID_TO_3 CODE_3

000     Mike   Jones   2022-10-01     2022-10-31  102    2022-10-31  115    2023-02-28 101
000     Mike   Jones   2022-11-01     2023-02-28  102    2023-02-28  115    n/a        n/a
000     Mike   Jones   2023-03-01     2260-01-01  101    2260-01-01  102    2260-01-01 115

What would be the best way to tackle this issue? I've looked into the groupby() function but I still don't understand it very well.

CodePudding user response:

You could try the following with df your dataframe:

from itertools import chain

group = ["ID", "NAME", "FAMILY", "VALID_FROM"]
df = (
    df
    .sort_values(["VALID_FROM", "VALID_TO","CODE"])
    .assign(COUNT=lambda df: df.groupby(group).transform("cumcount")   1)
    .pivot_table(index=group, columns="COUNT", aggfunc="first")
)
df.columns = [f"{c}_{n}" for c, n in df.columns]
result = df[chain(*zip(df.columns[3:], df.columns[:3]))].reset_index()

Result for your sample:

   ID  NAME FAMILY  VALID_FROM  VALID_TO_1  CODE_1  VALID_TO_2  CODE_2  \
0   0  Mike  Jones  2022-10-01  2022-10-31   102.0  2022-10-31   115.0   
1   0  Mike  Jones  2022-11-01  2023-02-28   102.0  2023-02-28   115.0   
2   0  Mike  Jones  2023-03-01  2260-01-01   101.0  2260-01-01   102.0   

   VALID_TO_3  CODE_3  
0  2023-02-28   101.0  
1         NaN     NaN  
2  2260-01-01   115.0

CodePudding user response:

import pandas as pd

# Load the data into a DataFrame
df = pd.read_csv('hr_data.csv')

# Group the data by ID, NAME, and FAMILY
grouped = df.groupby(['ID', 'NAME', 'FAMILY'])

# Pivot the data so that each unique VALID_FROM value becomes a column
pivoted = grouped.pivot(columns='VALID_FROM')

# View the resulting DataFrame
print(pivoted)

# Use the explode method to expand each list into its own row
expanded = pivoted.explode('CODE')

# View the resulting DataFrame
print(expanded)
  • Related