Home > OS >  How to combine two dataframes to override df_1 empty values with df_2 values (IF exist) AND keep all
How to combine two dataframes to override df_1 empty values with df_2 values (IF exist) AND keep all

Time:12-23

I'm trying to merge 2 dfs in a particular way.

Input:

import pandas as pd
import numpy as np

df = pd.DataFrame([('US', 'apples', 10), ('US', 'oranges', np.nan), ('US', 'grapes', np.nan)], columns = ['country', 'product', 'price'])
df_2 = pd.DataFrame([('US', 'bananas', 10), ('US', 'oranges', 10), ('US', 'oranges', 15)], columns = ['country', 'product', 'price'])

df
  country  product  price
0      US   apples   10.0
1      US  oranges    NaN  # todo - here this line is w/o price
2      US   grapes    NaN
df_2
  country  product  price
0      US  bananas     10
1      US  oranges     10  # todo - here we have price_1
2      US  oranges     15  # todo - and price_2

What I'm looking for: I want to get dataframe where df and df_2 are combined BUT df_1 empty line for oranges replaced by 2 lines from df_2.

  country  product  price
0      US   apples   10.0
2      US   grapes    NaN
0      US  bananas   10.0
1      US  oranges   10.0
2      US  oranges   15.0

What I've tried: If I concat them, then I still have oranges with 0 instead of having only oranges from df_2

df_3 = pd.concat([df, df_2])
df_3
  country  product  price
0      US   apples   10.0
1      US  oranges    NaN  # todo - I want to omit this line
2      US   grapes    NaN
0      US  bananas   10.0
1      US  oranges   10.0
2      US  oranges   15.0

Is there're an easy way to do it? Appreciate any help!

CodePudding user response:

After concat, looks like you want to drop the NaN price rows if there exist non-NaN rows for the same ['country', 'product'].

We can use groupby() and count to find the ['country', 'product'] groups with non-NaN prices. Then, we apply dropna() on these groups.

df_3 = pd.concat([df, df_2])
groups = df_3.groupby(['country', 'product'])['price']
group_counts = groups.transform('count') 
pd.concat([df_3[group_counts > 0].dropna(), df_3[group_counts == 0]])

Note that count does not count NaN's, so if a ['country', 'product'] group only has NaN prices, its count will be 0. Thus, df_3[group_counts > 0] only selects groups that have non-NaN rows. Finally, we concat with the groups that only have NaN rows.

If you might have multiple NaN rows for the same ['country', 'product'] and want to combine them, change the last line to:

pd.concat([df_3[group_counts > 0].dropna(), df_3[group_counts == 0].drop_duplicates()])
  • Related