Home > Software engineering >  Merge: discard duplicates that do not exist in both left and right dataframes?
Merge: discard duplicates that do not exist in both left and right dataframes?

Time:10-27

I have the same data coming from two sources. Both are in pandas DataFrames. A few of things are happening:

  1. Due to a reading error in some old software, the data from one source 1 (left) is only a subset of the data from source 2 (right).
  2. The data from source 2 has one additional column.
  3. Both sources of data may contain valid duplicate values.
  4. There is no shared unique key between the two.

Desired Outcome

  1. I want to merge the two, keeping the intersection.
  2. Duplicates should only be kept if the duplicates exist in both dataframes. This seems difficult because the two dataframes are different lengths and share no unique keys.
  3. The extra column in the second dataframe should be in the new, merged dataframe.

Working example / What I've tried

Here is a simple example:

df1 = pd.DataFrame( 
    { # last entry a duplicate of the first
        'var1' : [1, 2, 3, 1,], 
        'var2'  : ['a', 'b', 'c', 'a']
    }
)

df2 = pd.DataFrame(  
    { # last two entries duplicates of first two.
    'cat': [1, 2, 3, 4, 1, 2],
    'var1' : [1, 2, 3, 4, 1, 2],
    'var2'  : ['a', 'b', 'c', 'd', 'a', 'b']
    }
)


merged = pd.merge(df1, df2, on=["var1", "var2"], how="inner")
merged

results in:

# - indicates undesired duplicates
  var1 var2 cat
0   1   a   1 
1   1   a   1 
2   1   a   1 #
3   1   a   1 #
4   2   b   2 
5   2   b   2 #
6   3   c   3 

But what I'm expecting is (order doesn't matter):

  var1 var2 cat
0   1   a   1 
1   1   a   1 
2   2   b   2 
3   3   c   3 

I took a look at dropping duplicates, but it does not discriminate and destroys valuable information.


merged.drop_duplicates()

  var1 var2 cat
0   1   a   1
4   2   b   2
6   3   c   3

One thought I had was to figure out a way to have two more binary columns that identifies the source before merging and drop any rows where both columns are not true.

  var1 var2 cat src1 src2
0   1   a   1    1     1 
1   1   a   1    1     1
2   1   a   1    0     1 # drop
3   1   a   1    0     1 # drop
4   2   b   2    1     1
5   2   b   2    0     1 # drop
6   3   c   3    1     1 

However, I'm not sure of how to achieve such labelling during a merge (if it's even possible).

Question:

So, in short, the question is "How do I merge two tables, allowing preserving duplicates only when they exist in both tables"?

CodePudding user response:

Try to merge with enumeration:

key_cols = ['var1','var2']
(df1.assign(enum=df1.groupby(key_cols).cumcount())
    .merge(df2.assign(enum=df2.groupby(key_cols).cumcount()),
           on=['enum']   key_cols, how='inner')
    .drop('enum', axis=1)
)

Output:

   var1 var2  cat
0     1    a    1
1     2    b    2
2     3    c    3
3     1    a    1
  • Related