Home > Enterprise >  mask or filter a dataframe (containing strings) using a second dataframe with NaN
mask or filter a dataframe (containing strings) using a second dataframe with NaN

Time:10-18

I have two dataframes: df1 and df2. They have the same shape. Here's what they look like:

df1

1 2 3 4 5
20% 10% 5% 1% 0%
20% 10% 5% 1% 0%
20% 10% 5% 1% 0%

df2

1 2 3 4 5
string word thing NaN NaN
string word thing NaN NaN
string word thing NaN NaN

I want to use df2 to mask or filter df1, such that "new_df1" looks like the below. In places where df2 has "NaN', I want df1 to have NaN. In places where df2 is not NaN, I want to keep the original value of df1.

new_df1

1 2 3 4 5
20% 10% 5% NaN NaN
20% 10% 5% NaN NaN
20% 10% 5% NaN NaN

I've found functions like df1.mask(df2=None), df1.mask(df2!=None), df1.where(df2!=None), and df1.where(df2=None). I either get a dataframe full of NaN or the opposite of what I want (NaN's in new_df1 where there is a value in df2). I suspect it's because the values in df1 are strings and not integers or floats? It looks like df.mask() and df.where() don't take string exceptions, and I can't find what the right function is, but it must exist. Can anyone help?

CodePudding user response:

You can try:

new_df1 = df1.mask(df2.isnull())

Or you can also do:

new_df1 = df1.where(~df2.isnull())

CodePudding user response:

Since the mask is aligned by index, you may need to reset the index of the two dataframes before doing the masking

Input

df1 = pd.DataFrame([[.1,.2], [.4,.5]], index=[1,2])
df2 = pd.DataFrame([['a',None], [None,'d']], index=[3,4])

Masking without aligning the index gives wrong result

df1.mask(df2.isnull())

    0   1
1   NaN NaN
2   NaN NaN

While resetting the index first gives correct result

df1.reset_index(drop=True).mask(df2.reset_index(drop=True).isnull())

    0   1
0   0.1 NaN
1   NaN 0.5
  • Related