EDITED* I have a large df with many rows that share the same value in some of the columns. I want to do the following:
- new df = identify the rows in df that have a value in a certain column (not empty).
'''
df = pd.DataFrame({"a": [1, 2,2,2, 3, 4],
"b":['A','B','B', 'B','C','D'],
"c":[NaN, 2,NaN,NaN,NaN,NaN]})
'''
df1=df[~df['c'].isnull()]
'''
- add to 'new_df' the rows from df that share 2 keys.
I tried to use merge:
df2 = pd.merge(df1,df,on=['a','b'], how='left')
But the result was that It added the same row a few times and not the unique rows
a b c_x c_y
0 2 B 2.0 2.0
1 2 B 2.0 NaN
2 2 B 2.0 NaN
I want to keep only one 'c' column with all the values. Not sure what approach to use.
Hope I made it clear... Thanks!
CodePudding user response:
As far as I understand, you would like to group by 'a' and 'b' and return only those groups where at least one row does not have a NaN in column 'c'. If that's the case. here you go
Load the df:
df = pd.DataFrame({"a": [1,1,1, 2,2,2, 3, 4], "b":['A','A','A','B','B', 'B','C','D'], "c":[None, None,None,2,None,None,None,None]})
filter for any non-NaNs:
df.groupby(['a','b']).filter(lambda g: any(~g['c'].isna()))
output:
a b c
3 2 B 2.0
4 2 B NaN
5 2 B NaN