I have a dataframe that is a edgelist for a undirected graph it looks like this:
node 1 node 2 doc
0 Kn Kn doc5477
1 TS Kn doc5477
2 Kn TS doc5477
3 TS TS doc5477
4 Kn Kn doc10967
5 Kn TS doc10967
6 TS TS doc10967
7 TS Kn doc10967
How can I make sure that the combinations of nodes for each document only appear once. Meaning that because row 1 and 2 have are the same I only want it to appear once. Same for rows 5 and 7?
So that my dataframe looks like this:
node 1 node 2 doc
0 Kn Kn doc5477
1 TS Kn doc5477
3 TS TS doc5477
4 Kn Kn doc10967
5 Kn TS doc10967
6 TS TS doc10967
CodePudding user response:
First, select the columns on which you need a unique combination (node1
, node2
and doc
in your case) then apply a sort to return a series with a list of combinations, and finally use a boolean mask with a negative pandas.DataFrame.duplicated
to keep only the rows that represent a unique combination.
Try this:
out= df.loc[~df[['node 1','node 2', 'doc']].apply(sorted, axis=1).duplicated()]
# Output :
print(out)
node 1 node 2 doc
0 Kn Kn doc5477
1 TS Kn doc5477
3 TS TS doc5477
4 Kn Kn doc10967
5 Kn TS doc10967
6 TS TS doc10967