The following piece of code works as expected, with no warnings. I create a dataframe, create two sub-dataframes from it using .loc
, give them the same index and then assign to a column of one of them.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(20, 4),
index=pd.Index(range(20)),
columns=['one', 'two', 'three', 'four'])
d1 = df.loc[[2, 4, 6], :]
d2 = df.loc[[3, 5, 7], :]
idx = pd.Index(list('abc'), name='foo')
d1.index = idx
d2.index = idx
d1['one'] = d1['one'] - d2['two']
However, if I do exactly the same thing except with a multi-indexed dataframe, I get a SettingWithCopyWarning
.
import numpy as np
import pandas as pd
arrays = [
np.array(["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"]),
np.array(["one", "two", "one", "two", "one", "two", "one", "two"]),
]
df = pd.DataFrame(np.random.randn(8, 4), index=arrays, columns=['one', 'two', 'three', 'four'])
d1 = df.loc[(['bar', 'qux', 'foo'], 'one'), :]
d2 = df.loc[(['bar', 'qux', 'foo'], 'two'), :]
idx = pd.Index(list('abc'), name='foo')
d1.index = idx
d2.index = idx
d1['one'] = d1['one'] - d2['two']
I know that I can avoid this warning by using .copy()
during the creation of df1
and df2
, but I struggle to understand why this is necessary in the second case but not in the first. The chained indexing is equally present in both cases, isn't it? Also, the operation works in both cases (i.e. d1
is modified but df
is not). So, what's the difference?
CodePudding user response:
You have to use set_index
to avoid the warning:
import numpy as np
import pandas as pd
arrays = [
np.array(["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"]),
np.array(["one", "two", "one", "two", "one", "two", "one", "two"]),
]
df = pd.DataFrame(np.random.randn(8, 4), index=arrays, columns=['one', 'two', 'three', 'four'])
d1 = df.loc[(['bar', 'qux', 'foo'], 'one'), :]
d2 = df.loc[(['bar', 'qux', 'foo'], 'two'), :]
idx = pd.Index(list('abc'), name='foo')
d1 = d1.set_index(idx) # <- HERE
d2 = d2.set_index(idx) # <- HERE
d1['one'] = d1['one'] - d2['two']
CodePudding user response:
I believe this falls into the internals of pandas. The decision to return a copy depends on several factors (dtype homogeneity,
What you can do is check whether or not you have a copy or view with _is_copy
, and force one if needed:
def ensure_copy(df):
if df._is_copy:
return df.copy()
return df
d1 = ensure_copy(df.loc[(['bar', 'qux', 'foo'], 'one'), :])
d2 = ensure_copy(df.loc[(['bar', 'qux', 'foo'], 'two'), :])
idx = pd.Index(list('abc'), name='foo')
d1.index = idx
d2.index = idx
d1['one'] = d1['one'] - d2['two']
Note that this is an internal pandas method, not a public one, so there is no guarantee that is will remain available in the future.