Selecting nodes based on a percentage to assign labels-CodePudding

I have two datasets that I need to create a network where nodes have specific colors based on the value of their label. The first dataset includes the following two columns (first and second refer to the source and target nodes):

 First  Second  
    0   2   
    0   3   
    2   4   
    2   5   
    3   6   
    ... ... 
    5   2   
    4   3   
    4   2

There are 100 distinct nodes in total.

A second dataset includes all labels of distinct nodes included in the first dataset:

I would like to create a network where

15% of nodes have labels (their own, from the second dataset)
85% of nodes have no label (their own label is removed in a new dataset, where the initial label is replaced by a NaN value).

What I would expect is something like this:

new dataset

First Second Label 
0      2       1
2      0       2 
0      3       1
3      0       2
2      4       2
4      2       NaN
2      5       NaN
5      2       NaN
4      3       NaN
3      4       NaN
...            ...

Label refers to the first node so it is important that all the nodes in the network have a label assigned based on their own label from the second dataset or NaN value. The percentages mentioned above are just for selecting the number of nodes that have label assigned and those nodes whose label was removed in order to be predicted.

I have tried to use mask but I cannot select the number of nodes that will keep/replace their label. I would like to select and keep the labels for the 15% of nodes randomly chosen and remove the labels from those nodes within the remaining 85%.

CodePudding user response：

First merge your two dataframes on First and Node columns, then sample the resulting dataframe with a fraction of 0.85 and assign NaN to Label-column of that sample:

df3 = df.merge(df2, left_on='First', right_on='Node')
df3.loc[df3.sample(frac=0.85).index, 'Label'] = 'NaN'