Home > Enterprise >  remove same combinations in dataframe pandas
remove same combinations in dataframe pandas

Time:11-21

I have a dataframe that is a edgelist for a undirected graph it looks like this:

    node 1 node 2 doc
0   Kn  Kn  doc5477 
1   TS  Kn  doc5477 
2   Kn  TS  doc5477 
3   TS  TS  doc5477 
4   Kn  Kn  doc10967
5   Kn  TS  doc10967
6   TS  TS  doc10967
7   TS  Kn  doc10967    

How can I make sure that the combinations of nodes for each document only appear once. Meaning that because row 1 and 2 have are the same I only want it to appear once. Same for rows 5 and 7?

So that my dataframe looks like this:

    node 1 node 2 doc
0   Kn  Kn  doc5477 
1   TS  Kn  doc5477     
3   TS  TS  doc5477 
4   Kn  Kn  doc10967
5   Kn  TS  doc10967
6   TS  TS  doc10967

CodePudding user response:

First, select the columns on which you need a unique combination (node1, node2 and doc in your case) then apply a sort to return a series with a list of combinations, and finally use a boolean mask with a negative pandas.DataFrame.duplicated to keep only the rows that represent a unique combination.

Try this:

out= df.loc[~df[['node 1','node 2', 'doc']].apply(sorted, axis=1).duplicated()]

# Output :

print(out)

  node 1 node 2        doc
0     Kn     Kn    doc5477
1     TS     Kn    doc5477
3     TS     TS    doc5477
4     Kn     Kn   doc10967
5     Kn     TS   doc10967
6     TS     TS   doc10967
  • Related