I have a dataframe that contains multiple hierarchy columns and a value attach to them, something like this:
location hierarhcy | industry hierarchy | value | |
---|---|---|---|
0 | US | All | V1 |
1 | US-California | All | V2 |
2 | US-California | All-HT | V3 |
3 | US-California-LA | All | V4 |
4 | US-California-LA | All-HT | V5 |
5 | US-California-LA | All-HT-SoftWare | V6 |
6 | US-California-LA | All-Manufacturing | V7 |
Now I want to find the immediate parents of each row and add their value a a new column. My example just shows two parents but it could be more. So, the expected output would be as follows:
location hierarhcy | industry hierarchy | value | value-parent-1 | value-parent-2 | |
---|---|---|---|---|---|
0 | US | All | V1 | NA | NA |
1 | US-California | All | V2 | V1 | NA |
2 | US-California | All-HT | V3 | V2 | NA |
3 | US-California-LA | All | V4 | V2 | NA |
4 | US-California-LA | All-HT | V5 | V3 | V4 |
5 | US-California-LA | All-HT-SoftWare | V6 | V5 | NA |
6 | US-California-LA | All-Manufacturing | V7 | V4 | NA |
My Dataframe could be extremely large, so the speed and memory are important factors. So, row-wise process won't really work.
Logic:
The immediate parents for a row will be all rows that are one level higher than one of the hierarchies. For example for row 5, the potential immediate parents will be:
- | location hierarhcy | industry hierarchy |
---|---|---|
0 | US-California | All-HT-SoftWare |
1 | US-California-LA | All-HT |
This is the code to create the dataframe if you want to play with it.
import pandas as pd
df = pd.DataFrame({'location hierarchy': ['US', 'US-California', 'US-California', 'US-California-LA', 'US-California-LA', 'US-California-LA', 'US-California-LA'],
'industry hierarchy': ['All', 'All', 'All-HT', 'All', 'All-HT', 'All-HT-SoftWare', 'All-Manufacturing'],
'value': ['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7']})
CodePudding user response:
Here you are:
df = pd.DataFrame({'Location Hierarchy': ['US', 'US-California', 'US-California', 'US-California-LA', 'US-California-LA', 'US-California-LA', 'US-California-LA'],
'Industry Hierarchy': ['All', 'All', 'All-HT', 'All', 'All-HT', 'All-HT-SoftWare', 'All-Manufacturing'],
'Value': ['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7']})
hierarchies = ["Location Hierarchy", "Industry Hierarchy"]
for i, hierarchy in enumerate(hierarchies):
df1 = df.copy()
df1[hierarchy] = df1[hierarchy].str.split('-').str[:-1].str.join("-")
df[f"Value_parent_{i 1}"] = df1.merge(df, on=hierarchies, how="left", suffixes=["", "_parent"])["Value_parent"]