I have the same data coming from two sources. Both are in pandas DataFrame
s. A few of things are happening:
- Due to a reading error in some old software, the data from one
source 1
(left) is only a subset of the data fromsource 2
(right). - The data from
source 2
has one additional column. - Both sources of data may contain valid duplicate values.
- There is no shared unique key between the two.
Desired Outcome
- I want to merge the two, keeping the intersection.
- Duplicates should only be kept if the duplicates exist in both dataframes. This seems difficult because the two dataframes are different lengths and share no unique keys.
- The extra column in the second dataframe should be in the new, merged dataframe.
Working example / What I've tried
Here is a simple example:
df1 = pd.DataFrame(
{ # last entry a duplicate of the first
'var1' : [1, 2, 3, 1,],
'var2' : ['a', 'b', 'c', 'a']
}
)
df2 = pd.DataFrame(
{ # last two entries duplicates of first two.
'cat': [1, 2, 3, 4, 1, 2],
'var1' : [1, 2, 3, 4, 1, 2],
'var2' : ['a', 'b', 'c', 'd', 'a', 'b']
}
)
merged = pd.merge(df1, df2, on=["var1", "var2"], how="inner")
merged
results in:
# - indicates undesired duplicates
var1 var2 cat
0 1 a 1
1 1 a 1
2 1 a 1 #
3 1 a 1 #
4 2 b 2
5 2 b 2 #
6 3 c 3
But what I'm expecting is (order doesn't matter):
var1 var2 cat
0 1 a 1
1 1 a 1
2 2 b 2
3 3 c 3
I took a look at dropping duplicates, but it does not discriminate and destroys valuable information.
merged.drop_duplicates()
var1 var2 cat
0 1 a 1
4 2 b 2
6 3 c 3
One thought I had was to figure out a way to have two more binary columns that identifies the source before merging and drop any rows where both columns are not true.
var1 var2 cat src1 src2
0 1 a 1 1 1
1 1 a 1 1 1
2 1 a 1 0 1 # drop
3 1 a 1 0 1 # drop
4 2 b 2 1 1
5 2 b 2 0 1 # drop
6 3 c 3 1 1
However, I'm not sure of how to achieve such labelling during a merge (if it's even possible).
Question:
So, in short, the question is "How do I merge two tables, allowing preserving duplicates only when they exist in both tables"?
CodePudding user response:
Try to merge with enumeration:
key_cols = ['var1','var2']
(df1.assign(enum=df1.groupby(key_cols).cumcount())
.merge(df2.assign(enum=df2.groupby(key_cols).cumcount()),
on=['enum'] key_cols, how='inner')
.drop('enum', axis=1)
)
Output:
var1 var2 cat
0 1 a 1
1 2 b 2
2 3 c 3
3 1 a 1