Pandas SettingWithCopyWarning with np.where-CodePudding

I have a pandas data frame that has a column containing a bunch of correlations (all float values). I'm trying to create another column to categorise these correlations into three distinct categories (high/medium/low). I do this using np.where:

df['Category'] = np.where(df['Correlation'] >= 0.5, 'high', 
                                   np.where(data['Correlation'] >= 0.3, 'medium','low'))

When I try doing this, I always get the SettingWithCopyWarning (it seems to work though). I have read up on the difference between copies and views, and even seen recommendations to use .where over other methods to avoid any confusion (and the SettingWithCopyWarning). I still can't quite wrap my head around why I get the warning with this method, can someone explain?

CodePudding user response：

Your code does not generate the warning in my environment.

import pandas as pd
import numpy as np
df = pd.DataFrame({'Correlation':[0,.1,.2,.3,.4,.5,.6,.7,.8,.9,1]})
df['Category'] = np.where(df['Correlation'] >= 0.5, 'high', 
                                   np.where(df['Correlation'] >= 0.3, 'medium','low'))

print(f'pandas version: {pd.__version__}')
print(f'numpy version: {np.version.version}')
import platform
print(f'python version: {platform.python_version()}')
print(df)

Output:

pandas version: 1.4.1
numpy version: 1.21.6
python version: 3.10.2
    Correlation Category
0           0.0      low
1           0.1      low
2           0.2      low
3           0.3   medium
4           0.4   medium
5           0.5     high
6           0.6     high
7           0.7     high
8           0.8     high
9           0.9     high
10          1.0     high

It may be a pandas version issue. I suppose it's possible in theory that differences in the df dataframe between my test case and yours could result in your getting the warning, but it's not obvious to me that this could be the case.

CodePudding user response：

Most likely your df has been created as a view of another DataFrame, e.g.:

data = pd.DataFrame({'Correlation': np.arange(0, 1.3, 0.1)})  # Your "initial" DataFrame
df = data.iloc[0:11]

Now df holds some fragment of data, but it uses the data buffer of data.

Then if you attempt to run:

df['Category'] = np.where(df['Correlation'] >= 0.5, 'high',
    np.where(df['Correlation'] >= 0.3, 'medium', 'low'))

just the mentioned warning occurs.

To get rid of it, create df as an independent DataFrame, e.g.:

df = data.iloc[0:11].copy()

Now df uses its own data buffer and you may do with it whatever you wish, including adding new columns.

To check whether your df uses its own data buffer, run:

df._is_view

In your original environment (without my correction) you should get False, but after you created df using .copy() you should get True.