Add rows from another df based on keys pandas-CodePudding

EDITED* I have a large df with many rows that share the same value in some of the columns. I want to do the following:

new df = identify the rows in df that have a value in a certain column (not empty).

'''

df = pd.DataFrame({"a": [1, 2,2,2, 3, 4],
                   "b":['A','B','B', 'B','C','D'],
                   "c":[NaN, 2,NaN,NaN,NaN,NaN]})

'''

df1=df[~df['c'].isnull()]

'''

add to 'new_df' the rows from df that share 2 keys.

I tried to use merge:

df2 = pd.merge(df1,df,on=['a','b'], how='left')

But the result was that It added the same row a few times and not the unique rows

    a   b   c_x c_y
0   2   B   2.0 2.0
1   2   B   2.0 NaN
2   2   B   2.0 NaN

I want to keep only one 'c' column with all the values. Not sure what approach to use.

Hope I made it clear... Thanks!

CodePudding user response：

As far as I understand, you would like to group by 'a' and 'b' and return only those groups where at least one row does not have a NaN in column 'c'. If that's the case. here you go

Load the df:

df = pd.DataFrame({"a": [1,1,1, 2,2,2, 3, 4], "b":['A','A','A','B','B', 'B','C','D'], "c":[None, None,None,2,None,None,None,None]})

filter for any non-NaNs:

df.groupby(['a','b']).filter(lambda g: any(~g['c'].isna()))

output:


    a   b   c
3   2   B   2.0
4   2   B   NaN
5   2   B   NaN