Pandas - How to merge two dataframes properly with conditions?-CodePudding

I have two dataframes, and I want to merge them based on an im_id. However, for the im_df I want to keep everything that dd_df doesn't have and add what dd_df has that im_df doesn't have.

Basically, if im_df had 1, 2, 3, and 6, and dd_df had 1, 2, and 5, I want the total to be 1, 2, 3, 5, and 6 - with 1 and 2 taken from im_df (should be the same as what's in dd_df).

I've tried this so far:

df = im_df.merge(dd_df, on=["im_id"], suffixes=("", "_DROP"))
df = df.filter(regex="^(?!.*_DROP)")

df = df.drop(
    columns=[
        "some_col_from_im_df_we_dont_need",
        "and_another",
    ]
)
df = df.drop_duplicates(subset="im_id")

This doesn't seem to work, as df ends up with about 560k rows. im_df initially has 1.5m rows, and dd_df has about 570k.

Does anyone know what I might be doing wrong?

For instance, one dataframe (x) has the following:

f_id | im_id | re_id | ip_id
  1      1       3       4
  2      5       1       3

and the other dataframe (y) has

f_id | im_id | evo
  1      1      "hello"
  2      5      "yo"
  3      4      "boo"

And I want the resultant table to be

f_id | im_id | re_id | ip_id | evo
  1       1      3       4     "hello"
  2       5      1       3     "yo"

CodePudding user response：

I think you should merge on f_id and im_id and use the left attribute:

pd.merge(x, y, on=["f_id", "im_id"], how="left")

With this you can replicate your example output

  f_id im_id re_id  ip_id       evo
0    1     1     3      4   "hello"
1    2     5     1      3      "yo"

CodePudding user response：

I think what you want is a merge and append. If I'm understanding you correctly, you just want to add new rows from dd_df when they do not exist in im_df. Assuming both dataframes have the same columns (if they do not have the same columns, perform the following on a subset of dd_df that has the same columns, unless you want the new columns added)

df_temp = im_df.merge(dd_df, how='inner', on='im_id')
df_new = im_df.append(dd_df[~dd_df.index.isin(df_temp)])

This appends all the rows in dd_df that are not in im_df.

You can probably get away with not making df_temp too (haven't tested the following but give it a try)

df_new = im_df.append(dd_df[~dd_df.im_id.isin(im_df.im_id)])