I need to gather information from an existing dataset. The dataset looks as follows:
Source Target Label_S Weight Prop_1 Prop_2 Mer_1 Mer_2
car airplane 0.5 0.2 1 0 0 0
car train 0.5 0.5 1 1 0 1
car bike 0.5 0.2 1 1 0 0
bike motorbike 1 0.7 1 1 0 1
bike car 1 0.2 1 1 0 0
airplane car -1 0.2 0 1 0 0
train car 1 0.5 1 1 1 0
motorbike car 1 0.7 1 1 1 0
motorbike toy 1 0.6 1 0 1 1
Label_S, Prop_1
and Mer_1
are Source
's properties; Prop_2
and Mer_2
are Target
's properties.
I am trying to create a list of unique nodes from both Source
and Target
, including their properties; something like this:
Node Label_S Property Merchandising
car 0.5 1 0
airplane -1 0 0
train 1 1 1
bike 1 1 0
motorbike 1 1 1
toy 0 1
I had not problem to create the list including all the nodes:
source = df['Source'].unique().tolist()
target = df['Target'].unique().tolist()
all_nodes=list(source target)
but I am not actually understanding how to get information from properties columns based on Source
/Target
information.
I think I should first split the dataframe in two dataframes: one with Source
plus the properties of Source
; the other one with Target
elements plus the properties of Target.
Once got this information, maybe it could be good to append the two dataframes and remove duplicates under the column Node
. But I feel that something is wrong: for example, I have Label_S
which is a property of Source
and not of Target...
CodePudding user response:
I'm not sure I understand correctly but you can first create two distinct dataframes for source and target and aggregate the properties in lists:
df_s = df[["Source", "Label_S", "Prop_1", "Mer_1"]].groupby("Source").agg(list)
df_t = df[["Target", "Weight", "Prop_2", "Mer_2"]].groupby("Target").agg(list)
print(df_s)
print(df_t)
Output:
Label_S Prop_1 Mer_1
Source
airplane [-1.0] [0] [0]
bike [1.0, 1.0] [1, 1] [0, 0]
car [0.5, 0.5, 0.5] [1, 1, 1] [0, 0, 0]
motorbike [1.0, 1.0] [1, 1] [1, 1]
train [1.0] [1] [1]
Weight Prop_2 Mer_2
Target
airplane [0.2] [0] [0]
bike [0.2] [1] [0]
car [0.2, 0.2, 0.5, 0.7] [1, 1, 1, 1] [0, 0, 0, 0]
motorbike [0.7] [1] [1]
toy [0.6] [0] [1]
train [0.5] [1] [1]
Edit
You can aggregate the properties differently to keep only one value (e.g. max
), then merge you dataframes:
df_s = df[["Source", "Label_S", "Prop_1", "Mer_1"]].groupby("Source", as_index=False).agg(max)
df_t = df[["Target", "Weight", "Prop_2", "Mer_2"]].groupby("Target", as_index=False).agg(max)
df_s.columns = ["Node", "Label_S", "Property", "Merchandising"]
df_t.columns = ["Node", "Weight", "Property", "Merchandising"]
print(df_s.merge(df_t, how="outer").set_index("Node"))
Output:
Label_S Property Merchandising Weight
Node
airplane -1.0 0 0 0.2
bike 1.0 1 0 0.2
car 0.5 1 0 0.7
motorbike 1.0 1 1 0.7
train 1.0 1 1 0.5
toy NaN 0 1 0.6
If you want to exclude the Weight
column:
print(df_s.merge(df_t[["Node", "Property", "Merchandising"]], how="outer").set_index("Node"))