Home > OS >  pandas - mask works on whole dataframe but on selected columns?
pandas - mask works on whole dataframe but on selected columns?

Time:09-08

I was replacing values in columns and noticed that if use mask on all the dataframe, it will produce expected results, but if I used it against selected columns with .loc, it won't change any value.

Can you explain why and tell if it is expected result?

You can try with a dataframe dt, containing 0 in columns:

dt = pd.DataFrame(np.random.randint(0,3,size=(10, 3)), columns=list('ABC'))
dt.mask(lambda x: x == 0, np.nan, inplace=True)
# will replace all zeros to nan, OK.

But:

dt = pd.DataFrame(np.random.randint(0,3,size=(10, 3)), columns=list('ABC'))
columns = list('BC')
dt.loc[:, columns].mask(lambda x: x == 0, np.nan, inplace=True)
# won't cange anything. I excpet B, C columns to have values replaced

CodePudding user response:

i guess it's because the DataFrame.loc property is just giving access to a slice of your dataframe and you are masking a copy of the dataframe so it doesn't affect the data.

you can try this instead:

dt[columns] = dt[columns].mask(dt[columns] == 0)

CodePudding user response:

The loc functions returns a copy of the dataframe. On this copy you are applying the mask function that perform the operation in place on the data. You can't do this on a one-liner, otherwise the memory copy remains inaccessible. To get access to that memory area you have to split the code into 2 lines, to get a reference to that memory area:

tmp = dt.loc[:, columns]
tmp.mask(tmp[columns] == 0, np.nan, inplace=True)

and then you can go and update the dataframe:

dt[columns] = tmp

Not using the inplace update of the mask function, on the other hand, you can do everything with one line of code

dt[columns] = dt.loc[:, columns].mask(dt[columns] == 0, np.nan, inplace=False)

Extra: If you want to better understand the use of the inplace method in pandas, I recommend you read these posts:

  • Related