Home > OS >  Determine number of common rows (or exact intersection of rows) from two dataframe python (with exce
Determine number of common rows (or exact intersection of rows) from two dataframe python (with exce

Time:01-08

I am trying to get the exact intersection of rows between two pandas df in python. I am able to do it with the help of merge() function.

Current logic:

import pandas as pd

# input df's
data1 = pd.DataFrame({'x1':[1,2,3,4,5,3],                   
                      'x3':[9,8,7,6,6,8]})

data3 = pd.DataFrame({'x1':[2,1,2,6,4,4,5],                   
                      'x3':[8,3,9,8,7,6,6]})


data_13 = data1.merge(data3,                                  # Merge DataFrames with indicator 
                        indicator = True,
                        how = 'outer')
print(data_13)                                               

## common rows (the appears in both data1 and data3)
data_13_diff = data_13.loc[lambda x : x['_merge'] == 'both'] 
print(data_13_diff)   
                                        
## count the number of rows in data_13_diff
print('count:',data_13_diff.shape[0])

Output:

   x1  x3 _merge
1   2   8   both
3   4   6   both
4   5   6   both
count: 3

As expected, the output shows the common rows (intersecting rows) in both data1 and data3

However, stuck at this exception when there are multiple same value rows in one df. For example:

# input df's
data1 = pd.DataFrame({'x1':[1,2,3,4,5,2],                    
                      'x3':[9,8,7,6,6,8]})

data3 = pd.DataFrame({'x1':[1,2,2,4,4,5,3],                   
                      'x3':[3,9,8,7,6,6,8]})

The output comes to be:

   x1  x3 _merge
1   2   8   both
2   2   8   both
4   4   6   both
5   5   6   both
count: 4

even though there is only one instance of (2,8) in data3, the current logic outputs 2 such instances as the data1 has 2 of (2,8). This is not required for the task. The required output here should be the "exact intersection of two df", i.e. as the (2,8) should appear only once just like other common entries, (4,6) and (5,6). This will lead to the correct count of 3 common rows in both df.

Similar thing is also observed when the input is:

# input df's
data1 = pd.DataFrame({'x1':[1,2,3,4,5,2],                    
                      'x3':[9,8,7,6,6,8]})

data3 = pd.DataFrame({'x1':[2,1,2,2,4,4,5],                   
                      'x3':[8,3,9,8,7,6,6]})

Here, both data1 and data3 has two instances of (2,8). So the required output should be a total of 4 common rows (with 2 instances of (2,8) and one each of (4,6) and (5,6)). Instead, the current logic gets 4 instances of (2,8)!!!

   x1  x3 _merge
1   2   8   both
2   2   8   both
3   2   8   both
4   2   8   both
6   4   6   both
7   5   6   both
count: 6

If anyone can help me fix this issue in the logic that will be greatly appreciated. Any alternative suggestion/feedback are also welcomed. :)

Cheers!

CodePudding user response:

Merge will do cartesian product if there are duplicated values in the join columns. To avoid this, you can create a dummy count column to deduplicate before merging. Use the last case as an example:

data1['cnt'] = data1.groupby(data1.columns.tolist()).x1.cumcount()
data3['cnt'] = data3.groupby(data3.columns.tolist()).x1.cumcount()

data1.merge(data3, how='inner')
   x1  x3  cnt
0   2   8    0
1   4   6    0
2   5   6    0
3   2   8    1
  • Related