I have two dataframes, and I want to merge them based on an im_id
. However, for the im_df
I want to keep everything that dd_df
doesn't have and add what dd_df
has that im_df
doesn't have.
Basically, if im_df
had 1, 2, 3, and 6, and dd_df
had 1, 2, and 5, I want the total to be 1, 2, 3, 5, and 6 - with 1 and 2 taken from im_df
(should be the same as what's in dd_df
).
I've tried this so far:
df = im_df.merge(dd_df, on=["im_id"], suffixes=("", "_DROP"))
df = df.filter(regex="^(?!.*_DROP)")
df = df.drop(
columns=[
"some_col_from_im_df_we_dont_need",
"and_another",
]
)
df = df.drop_duplicates(subset="im_id")
This doesn't seem to work, as df
ends up with about 560k rows. im_df
initially has 1.5m rows, and dd_df
has about 570k.
Does anyone know what I might be doing wrong?
For instance, one dataframe (x) has the following:
f_id | im_id | re_id | ip_id
1 1 3 4
2 5 1 3
and the other dataframe (y) has
f_id | im_id | evo
1 1 "hello"
2 5 "yo"
3 4 "boo"
And I want the resultant table to be
f_id | im_id | re_id | ip_id | evo
1 1 3 4 "hello"
2 5 1 3 "yo"
CodePudding user response:
I think you should merge on f_id and im_id and use the left
attribute:
pd.merge(x, y, on=["f_id", "im_id"], how="left")
With this you can replicate your example output
f_id im_id re_id ip_id evo
0 1 1 3 4 "hello"
1 2 5 1 3 "yo"
CodePudding user response:
I think what you want is a merge and append. If I'm understanding you correctly, you just want to add new rows from dd_df when they do not exist in im_df. Assuming both dataframes have the same columns (if they do not have the same columns, perform the following on a subset of dd_df that has the same columns, unless you want the new columns added)
df_temp = im_df.merge(dd_df, how='inner', on='im_id')
df_new = im_df.append(dd_df[~dd_df.index.isin(df_temp)])
This appends all the rows in dd_df that are not in im_df.
You can probably get away with not making df_temp too (haven't tested the following but give it a try)
df_new = im_df.append(dd_df[~dd_df.im_id.isin(im_df.im_id)])