I'm trying to merge 2 dfs in a particular way.
Input:
import pandas as pd
import numpy as np
df = pd.DataFrame([('US', 'apples', 10), ('US', 'oranges', np.nan), ('US', 'grapes', np.nan)], columns = ['country', 'product', 'price'])
df_2 = pd.DataFrame([('US', 'bananas', 10), ('US', 'oranges', 10), ('US', 'oranges', 15)], columns = ['country', 'product', 'price'])
df
country product price
0 US apples 10.0
1 US oranges NaN # todo - here this line is w/o price
2 US grapes NaN
df_2
country product price
0 US bananas 10
1 US oranges 10 # todo - here we have price_1
2 US oranges 15 # todo - and price_2
What I'm looking for: I want to get dataframe where df and df_2 are combined BUT df_1 empty line for oranges replaced by 2 lines from df_2.
country product price
0 US apples 10.0
2 US grapes NaN
0 US bananas 10.0
1 US oranges 10.0
2 US oranges 15.0
What I've tried: If I concat them, then I still have oranges with 0 instead of having only oranges from df_2
df_3 = pd.concat([df, df_2])
df_3
country product price
0 US apples 10.0
1 US oranges NaN # todo - I want to omit this line
2 US grapes NaN
0 US bananas 10.0
1 US oranges 10.0
2 US oranges 15.0
Is there're an easy way to do it? Appreciate any help!
CodePudding user response:
After concat, looks like you want to drop the NaN price rows if there exist non-NaN rows for the same ['country', 'product'].
We can use groupby() and count to find the ['country', 'product'] groups with non-NaN prices. Then, we apply dropna() on these groups.
df_3 = pd.concat([df, df_2])
groups = df_3.groupby(['country', 'product'])['price']
group_counts = groups.transform('count')
pd.concat([df_3[group_counts > 0].dropna(), df_3[group_counts == 0]])
Note that count does not count NaN's, so if a ['country', 'product'] group only has NaN prices, its count will be 0. Thus, df_3[group_counts > 0] only selects groups that have non-NaN rows. Finally, we concat with the groups that only have NaN rows.
If you might have multiple NaN rows for the same ['country', 'product'] and want to combine them, change the last line to:
pd.concat([df_3[group_counts > 0].dropna(), df_3[group_counts == 0].drop_duplicates()])