Home > Blockchain >  Why do I get a SettingWithCopyWarning when using a MultiIndex (but not with a simple index)?
Why do I get a SettingWithCopyWarning when using a MultiIndex (but not with a simple index)?

Time:01-24

The following piece of code works as expected, with no warnings. I create a dataframe, create two sub-dataframes from it using .loc, give them the same index and then assign to a column of one of them.

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.randn(20, 4),
                  index=pd.Index(range(20)),
                  columns=['one', 'two', 'three', 'four'])

d1 = df.loc[[2, 4, 6], :]
d2 = df.loc[[3, 5, 7], :]

idx = pd.Index(list('abc'), name='foo')
d1.index = idx
d2.index = idx

d1['one'] = d1['one'] - d2['two']

However, if I do exactly the same thing except with a multi-indexed dataframe, I get a SettingWithCopyWarning.

import numpy as np
import pandas as pd

arrays = [
    np.array(["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"]),
    np.array(["one", "two", "one", "two", "one", "two", "one", "two"]),
]
df = pd.DataFrame(np.random.randn(8, 4), index=arrays, columns=['one', 'two', 'three', 'four'])

d1 = df.loc[(['bar', 'qux', 'foo'], 'one'), :]
d2 = df.loc[(['bar', 'qux', 'foo'], 'two'), :]

idx = pd.Index(list('abc'), name='foo')
d1.index = idx
d2.index = idx

d1['one'] = d1['one'] - d2['two']

I know that I can avoid this warning by using .copy() during the creation of df1 and df2, but I struggle to understand why this is necessary in the second case but not in the first. The chained indexing is equally present in both cases, isn't it? Also, the operation works in both cases (i.e. d1 is modified but df is not). So, what's the difference?

CodePudding user response:

You have to use set_index to avoid the warning:

import numpy as np
import pandas as pd

arrays = [
    np.array(["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"]),
    np.array(["one", "two", "one", "two", "one", "two", "one", "two"]),
]
df = pd.DataFrame(np.random.randn(8, 4), index=arrays, columns=['one', 'two', 'three', 'four'])

d1 = df.loc[(['bar', 'qux', 'foo'], 'one'), :]
d2 = df.loc[(['bar', 'qux', 'foo'], 'two'), :]

idx = pd.Index(list('abc'), name='foo')
d1 = d1.set_index(idx)  # <- HERE
d2 = d2.set_index(idx)  # <- HERE

d1['one'] = d1['one'] - d2['two']

CodePudding user response:

I believe this falls into the internals of pandas. The decision to return a copy depends on several factors (dtype homogeneity,

What you can do is check whether or not you have a copy or view with _is_copy, and force one if needed:

def ensure_copy(df):
    if df._is_copy:
        return df.copy()
    return df

d1 = ensure_copy(df.loc[(['bar', 'qux', 'foo'], 'one'), :])
d2 = ensure_copy(df.loc[(['bar', 'qux', 'foo'], 'two'), :])

idx = pd.Index(list('abc'), name='foo')
d1.index = idx
d2.index = idx

d1['one'] = d1['one'] - d2['two']

Note that this is an internal pandas method, not a public one, so there is no guarantee that is will remain available in the future.

  • Related