I have a data frame, which has an embedded tree structure inside it, like this:
df1
Category Type Dependent-Category
0 1 ~ A
1 1 ~ B
2 1 ~ C
3 1 ~ 14
4 1 ~ D
5 1 P NaN
6 A ~ C
7 A ~ D
8 A ~ 3
9 A P NaN
10 B ~ D
11 B ~ C
12 B ~ 12
13 B P NaN
14 C ~ D
15 C ~ 9
16 C P NaN
17 D ~ 12
18 D ~ 3
19 D ~ 8
20 D P NaN
Category D rows are only made out of numerical categories, which is defined as.
D ~ 12
D ~ 3
D ~ 8
D P NaN
Category C rows are made out of a numerical Category and the previous Category D, which is defined as:
C ~ D
C ~ 9
C P NaN
Category B are made out of a numerical Category and the previous 2 Categories D and C, which is defined as:
B ~ D
B ~ C
B ~ 12
B P NaN
Category A are made out of a numerical Category and the previous 2 Categories C and D, which is defined as:
A ~ C
A ~ D
A ~ 3
A P NaN
Category 1 are made out of a numerical Category and the previous 4 Categories A,B,C and D, which is defined as: 1 ~ A 1 ~ B 1 ~ C 1 ~ 14 1 ~ D 1 P NaN
The aim is to get a final dataframe which has is a total concatenation of all the Categories - for example, when Category
A is mentioned in the Dependent-Category
column in df1, I need to replace that entire row with Category
A.
The aim is to make the Dependent-Category
column only consist of numbers (and NaNs):
CodePudding user response:
Possible solution can be to separate out different categories as dataframes and then merge them sequentially with original dataframe then apply conditional formatting.
Here is my solution doing the same:
d = {
"Category": [1, 1, 1, 1, "A", "A", "A", "B", "B"],
"Type": ["~", "~", "~", "P", "~", "~", "P", "~", "P"],
"D_C": ["B", "A", 14, "missing", 4, "B", "missing", 12, "missing"],
}
df = pd.DataFrame(d)
df_A = df.loc[df.Category == "A"]
df_B = df.loc[df.Category == "B"]
df1 = df.merge(df_A, left_on=df.D_C, right_on=df_A.Category, how="outer")
df1_A = pd.DataFrame()
df1_A["Category"] = np.where(
df1.D_C_x == df1.Category_y, df1.Category_y, df1.Category_x
)
df1_A["Type"] = np.where(df1.D_C_x == df1.Category_y, df1.Type_y, df1.Type_x)
df1_A["D_C"] = np.where(df1.D_C_x == df1.Category_y, df1.D_C_y, df1.D_C_x)
df2 = df1_A.merge(df_B, left_on=df1_A.D_C, right_on=df_B.Category, how="outer")
df_final = pd.DataFrame()
df_final["Category"] = np.where(
df2.D_C_x == df2.Category_y, df2.Category_y, df2.Category_x
)
df_final["Type"] = np.where(df2.D_C_x == df2.Category_y, df2.Type_y, df2.Type_x)
df_final["D_C"] = np.where(df2.D_C_x == df2.Category_y, df2.D_C_y, df2.D_C_x)
print(df_final)
Let me know if you have questions around it.
CodePudding user response:
I would approach this using graph theory.
The advantage is that you do not need to worry about the depth of the mappings (e.g., B->A->C->D would be tricky, even worse if you have circular paths).
You have the following directed graph:
You can use networkx
to generate your graph from the part of your DataFrame that has a dependent category, and to map the childrens automatically:
import networkx as nx
# split the DataFrame in twao parts
# df1 will contain the rows with a dependent category
# df2 will contain the rows with the values
# using here a regex to identify letter but any other logic can be used
m = df['Dependent-Category'].str.match('[a-zA-Z] ').fillna(False)
df1 = df[m]
df2 = df[~m]
# generating the graph
G = nx.from_pandas_edgelist(df1, create_using=nx.DiGraph,
source='Dependent-Category', target='Category')
# keep only nodes with children
nodes = set(filter(G.out_degree, G.nodes))
# {'A', 'B'}
# mapping nodes to children and self
d = {n: (nx.descendants(G, n)&nodes)|{n} for n in nodes}
# {'A': {'A'}, 'B': {'A', 'B'}}
# replacing the values in the DataFrame
df3 = (df1['Dependent-Category']
.map(d).explode()
.to_frame('Category')
.reset_index()
.merge(df2)
.set_index('index')
)
# merging the above df3 to df2
out = pd.concat([df2, df3]).sort_index(kind='stable')
output:
Category Type Dependent-Category
0 A ~ 4
0 A P NaN
0 B ~ 12
0 B P NaN
1 A ~ 4
1 A P NaN
2 1 ~ 14
3 1 P NaN
4 A ~ 4
5 A ~ 4
5 A P NaN
5 B ~ 12
5 B P NaN
6 A P NaN
7 B ~ 12
8 B P NaN