Extracting values from existing dataset-CodePudding

I need to gather information from an existing dataset. The dataset looks as follows:

Source      Target     Label_S  Weight   Prop_1    Prop_2  Mer_1   Mer_2
car         airplane     0.5       0.2     1         0      0       0
car         train        0.5       0.5     1         1      0       1
car         bike         0.5       0.2     1         1      0       0
bike        motorbike    1       0.7       1         1      0       1
bike        car          1       0.2       1         1      0       0
airplane    car          -1      0.2       0         1      0       0         
train       car          1       0.5       1         1      1       0  
motorbike   car          1       0.7       1         1      1       0
motorbike   toy          1       0.6       1         0      1       1

Label_S, Prop_1 and Mer_1 are Source's properties; Prop_2 and Mer_2 are Target's properties. I am trying to create a list of unique nodes from both Source and Target, including their properties; something like this:

Node       Label_S  Property   Merchandising
car        0.5        1             0
airplane   -1         0             0
train      1          1             1
bike       1          1             0 
motorbike  1          1             1
toy                   0             1

I had not problem to create the list including all the nodes:

source = df['Source'].unique().tolist()
target = df['Target'].unique().tolist()
all_nodes=list(source   target)

but I am not actually understanding how to get information from properties columns based on Source/Target information.

I think I should first split the dataframe in two dataframes: one with Source plus the properties of Source; the other one with Target elements plus the properties of Target. Once got this information, maybe it could be good to append the two dataframes and remove duplicates under the column Node. But I feel that something is wrong: for example, I have Label_S which is a property of Source and not of Target...

CodePudding user response：

I'm not sure I understand correctly but you can first create two distinct dataframes for source and target and aggregate the properties in lists:

df_s = df[["Source", "Label_S", "Prop_1", "Mer_1"]].groupby("Source").agg(list)
df_t = df[["Target", "Weight", "Prop_2", "Mer_2"]].groupby("Target").agg(list)

print(df_s)
print(df_t)

Output:

                   Label_S     Prop_1      Mer_1
Source                                          
airplane            [-1.0]        [0]        [0]
bike            [1.0, 1.0]     [1, 1]     [0, 0]
car        [0.5, 0.5, 0.5]  [1, 1, 1]  [0, 0, 0]
motorbike       [1.0, 1.0]     [1, 1]     [1, 1]
train                [1.0]        [1]        [1]
                         Weight        Prop_2         Mer_2
Target                                                     
airplane                  [0.2]           [0]           [0]
bike                      [0.2]           [1]           [0]
car        [0.2, 0.2, 0.5, 0.7]  [1, 1, 1, 1]  [0, 0, 0, 0]
motorbike                 [0.7]           [1]           [1]
toy                       [0.6]           [0]           [1]
train                     [0.5]           [1]           [1]

Edit

You can aggregate the properties differently to keep only one value (e.g. max), then merge you dataframes:

df_s = df[["Source", "Label_S", "Prop_1", "Mer_1"]].groupby("Source", as_index=False).agg(max)
df_t = df[["Target", "Weight", "Prop_2", "Mer_2"]].groupby("Target", as_index=False).agg(max)

df_s.columns = ["Node", "Label_S", "Property", "Merchandising"]
df_t.columns = ["Node", "Weight", "Property", "Merchandising"]

print(df_s.merge(df_t, how="outer").set_index("Node"))

Output:

           Label_S  Property  Merchandising  Weight
Node                                               
airplane      -1.0         0              0     0.2
bike           1.0         1              0     0.2
car            0.5         1              0     0.7
motorbike      1.0         1              1     0.7
train          1.0         1              1     0.5
toy            NaN         0              1     0.6

If you want to exclude the Weight column:

print(df_s.merge(df_t[["Node", "Property", "Merchandising"]], how="outer").set_index("Node"))