I have a pandas dataframe with a flattened hierarchy:
Level 1 ID | Level 2 ID | Level 3 ID | Level 4 ID | Name | Path |
---|---|---|---|---|---|
1 | null | null | null | Finance | Finance |
1 | 4 | null | null | Reporting | Finance > Reporting |
1 | 4 | 5 | null | Tax Reporting | Finance > Reporting > Tax Reporting |
What I want to do is add or replace with the Level ID columns with 4 Level Name columns based on the Level [] ID columns, like the following:
Level 1 Name | Level 2 Name | Level 3 Name | Level 4 Name | Name | Path |
---|---|---|---|---|---|
Finance | null | null | null | Finance | Finance |
Finance | Reporting | null | null | Reporting | Finance > Reporting |
Finance | Reporting | Tax Reporting | null | Tax Reporting | Finance > Reporting > Tax Reporting |
I would use a separator on the Path column, but in the real dataframe, there are IDs instead of names (formatted like "1 > 4 > 5")
How should I approach this?
CodePudding user response:
The logic is unclear, in particular what is the source of the final values? See two different options below.
Assuming the source is df['Name']
cols = df.filter(like='Level ').columns
names = df['Name'].values
mask = df[cols[:len(names)]].notna()
df[cols[:len(names)]] = mask.mul(names, axis=1).where(mask)
Output:
Level 1 ID Level 2 ID Level 3 ID Level 4 ID Name Path
0 Finance NaN NaN NaN Finance Finance
1 Finance Reporting NaN NaN Reporting Finance > Reporting
2 Finance Reporting Tax Reporting NaN Tax Reporting Finance > Reporting > Tax Reporting
If you rather want to extract from "Path"
cols = df.filter(like='Level ').columns
names = df['Path'].str.split(' > ', expand=True)
df.loc[:, cols[:names.shape[1]]] = names.to_numpy()
Output:
Level 1 ID Level 2 ID Level 3 ID Level 4 ID Name Path
0 Finance None None NaN Finance Finance
1 Finance Reporting None NaN Reporting Finance > Reporting
2 Finance Reporting Tax Reporting NaN Tax Reporting Finance > Reporting > Tax Reporting
CodePudding user response:
You can create a mapping Series to resolve number -> name:
cols = df1.columns[df.columns.str.contains('Level \d ID')]
idx = df.filter(like='Level').ffill(axis=1).iloc[:, -1].tolist()
sr = pd.Series(df['Name'].tolist(), index=idx)
df[cols] = df[cols].apply(lambda x: x.map(sr))
Output:
>>> df
Level 1 ID Level 2 ID Level 3 ID Level 4 ID Name Path
0 Finance NaN NaN NaN Finance Finance
1 Finance Reporting NaN NaN Reporting Finance > Reporting
2 Finance Reporting Tax Reporting NaN Tax Reporting Finance > Reporting > Tax Reporting