Home > front end >  Detect parent rows of each row and add their values to a new column
Detect parent rows of each row and add their values to a new column

Time:02-04

I have a dataframe that contains multiple hierarchy columns and a value attach to them, something like this:

location hierarhcy industry hierarchy value
0 US All V1
1 US-California All V2
2 US-California All-HT V3
3 US-California-LA All V4
4 US-California-LA All-HT V5
5 US-California-LA All-HT-SoftWare V6
6 US-California-LA All-Manufacturing V7

Now I want to find the immediate parents of each row and add their value a a new column. My example just shows two parents but it could be more. So, the expected output would be as follows:

location hierarhcy industry hierarchy value value-parent-1 value-parent-2
0 US All V1 NA NA
1 US-California All V2 V1 NA
2 US-California All-HT V3 V2 NA
3 US-California-LA All V4 V2 NA
4 US-California-LA All-HT V5 V3 V4
5 US-California-LA All-HT-SoftWare V6 V5 NA
6 US-California-LA All-Manufacturing V7 V4 NA

My Dataframe could be extremely large, so the speed and memory are important factors. So, row-wise process won't really work.

Logic:

The immediate parents for a row will be all rows that are one level higher than one of the hierarchies. For example for row 5, the potential immediate parents will be:

- location hierarhcy industry hierarchy
0 US-California All-HT-SoftWare
1 US-California-LA All-HT

This is the code to create the dataframe if you want to play with it.

import pandas as pd

df = pd.DataFrame({'location hierarchy': ['US', 'US-California', 'US-California', 'US-California-LA', 'US-California-LA', 'US-California-LA', 'US-California-LA'],
                   'industry hierarchy': ['All', 'All', 'All-HT', 'All', 'All-HT', 'All-HT-SoftWare', 'All-Manufacturing'],
                   'value': ['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7']})

CodePudding user response:

Here you are:

df = pd.DataFrame({'Location Hierarchy': ['US', 'US-California', 'US-California', 'US-California-LA', 'US-California-LA', 'US-California-LA', 'US-California-LA'],
                   'Industry Hierarchy': ['All', 'All', 'All-HT', 'All', 'All-HT', 'All-HT-SoftWare', 'All-Manufacturing'],
                   'Value': ['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7']})

hierarchies = ["Location Hierarchy", "Industry Hierarchy"]
for i, hierarchy in enumerate(hierarchies): 
    df1 = df.copy()
    df1[hierarchy] = df1[hierarchy].str.split('-').str[:-1].str.join("-")
    df[f"Value_parent_{i 1}"] = df1.merge(df, on=hierarchies, how="left", suffixes=["", "_parent"])["Value_parent"]
  • Related