Home > OS >  Pandas merge based on two columns removes a lot of the initial dataset
Pandas merge based on two columns removes a lot of the initial dataset

Time:07-08

I have two pandas dataframes representing the edge lists of a graph. One is a dataframe called edge_list, which contains all the edges in the graph and their corresponding weights (it's a weighted graph). The first few lines look like this:

 -------- -------- ------- 
| source | target | value |
 -------- -------- ------- 
|     31 |     25 |    17 |
|     45 |     21 |    67 |
|     68 |    127 |    10 |
|    ... |    ... |   ... |
 -------- -------- ------- 

The second dataframe, called edge_list_small, has been created from edge_list by applying a filter, thus it only contains a subset of all edges. It also does not contain the value column (edge weights).

Now I want the weights of those edges in my small dataframe. Thus I figured I would merge the two based on the two columns, source and target. I am using:

complete_edge_list = small_edge_list.merge(edge_list,on=['source','target'])

There is a problem however.

print(len(edge_list), len(alpha_edge_list), len(complete_edge_list))

returns:

81,229,074 7,763,018 1,841,776

(commas added by me for readability). What am I doing wrong?

CodePudding user response:

If you try a left merge:

complete_edge_list = small_edge_list.merge(edge_list, on=['source','target'],
                                           how='left')

and get the expected size, this means that the small dataframe is not a subset of the large one. You have missing matches and should have NaNs in the value column.

  • Related