I have a pandas dataframe which includes HR data. For each person, there are some records which I would like to merge based on 'VALID_FROM' column. Each row should consists of one single VALID_FROM with multiple VALID_TO and CODE values.
Here is an example of what I'm working on:
ID NAME FAMILY VALID_FROM VALID_TO CODE
000 Mike Jones 2023-03-01 2260-01-01 115
000 Mike Jones 2023-03-01 2260-01-01 102
000 Mike Jones 2022-11-01 2023-02-28 102
000 Mike Jones 2023-03-01 2260-01-01 101
000 Mike Jones 2022-10-01 2022-10-31 102
000 Mike Jones 2022-10-01 2022-10-31 115
000 Mike Jones 2022-11-01 2023-02-28 115
000 Mike Jones 2022-10-01 2023-02-28 101
And here is what I would like to have:
ID NAME FAMILY VALID_FROM_1 VALID_TO_1 CODE_1 VALID_TO_2 CODE_2 VALID_TO_3 CODE_3
000 Mike Jones 2022-10-01 2022-10-31 102 2022-10-31 115 2023-02-28 101
000 Mike Jones 2022-11-01 2023-02-28 102 2023-02-28 115 n/a n/a
000 Mike Jones 2023-03-01 2260-01-01 101 2260-01-01 102 2260-01-01 115
What would be the best way to tackle this issue? I've looked into the groupby() function but I still don't understand it very well.
CodePudding user response:
You could try the following with df
your dataframe:
from itertools import chain
group = ["ID", "NAME", "FAMILY", "VALID_FROM"]
df = (
df
.sort_values(["VALID_FROM", "VALID_TO","CODE"])
.assign(COUNT=lambda df: df.groupby(group).transform("cumcount") 1)
.pivot_table(index=group, columns="COUNT", aggfunc="first")
)
df.columns = [f"{c}_{n}" for c, n in df.columns]
result = df[chain(*zip(df.columns[3:], df.columns[:3]))].reset_index()
Result for your sample:
ID NAME FAMILY VALID_FROM VALID_TO_1 CODE_1 VALID_TO_2 CODE_2 \
0 0 Mike Jones 2022-10-01 2022-10-31 102.0 2022-10-31 115.0
1 0 Mike Jones 2022-11-01 2023-02-28 102.0 2023-02-28 115.0
2 0 Mike Jones 2023-03-01 2260-01-01 101.0 2260-01-01 102.0
VALID_TO_3 CODE_3
0 2023-02-28 101.0
1 NaN NaN
2 2260-01-01 115.0
CodePudding user response:
import pandas as pd
# Load the data into a DataFrame
df = pd.read_csv('hr_data.csv')
# Group the data by ID, NAME, and FAMILY
grouped = df.groupby(['ID', 'NAME', 'FAMILY'])
# Pivot the data so that each unique VALID_FROM value becomes a column
pivoted = grouped.pivot(columns='VALID_FROM')
# View the resulting DataFrame
print(pivoted)
# Use the explode method to expand each list into its own row
expanded = pivoted.explode('CODE')
# View the resulting DataFrame
print(expanded)