This is the first time I'm posting a question myself here. So far I've almost always found solutions to my problems in existing questions (What a great forum and community!). Please bear with me, though, if this question (or a very similar one) has been asked and answered elsewhere on stackoverflow.
I have a multiindex dataframe (test_data
) that contains different variables (outer level) for the same set of cities (inner level) and the same range of years (columns) that looks like this:
1990 1991 1992 1993 1994
VAR CITY
1 Berlin 40 41 42 43 44
Paris 36 35 34 33 32
London 30 30 30 30 30
2 Berlin 35 34 33 32 31
Paris 39 38 39 40 41
London 45 44 43 42 41
3 Berlin 24 25 26 27 28
Paris 24 24 25 26 27
London 29 29 29 30 31
2m Berlin 1 2 3 4 5
Paris 2 3 4 5 6
London 3 4 5 6 7
which one could obtain from this code:
test_dict = {(1,'Berlin'): [40,41,42,43,44],
(1,'Paris'): [36,35,34,33,32],
(1,'London'): [30,30,30,30,30],
(2,'Berlin'): [35,34,33,32,31],
(2,'Paris'): [39,38,39,40,41],
(2,'London'): [45,44,43,42,41],
(3,'Berlin'): [24,25,26,27,28],
(3,'Paris'): [24,24,25,26,27],
(3,'London'): [29,29,29,30,31],
('2m','Berlin'): [1,2,3,4,5],
('2m','Paris'): [2,3,4,5,6],
('2m','London'): [3,4,5,6,7]}
test_data = pd.DataFrame(test_dict, index=[1990,1991,1992,1993,1994]).transpose()
Now I want to set all values for variables 1 and 2 to NaN where the sum of variables 1 to 3 is less than 98 or greater than 102, i.e. Berlin 1994, Paris 1991, as well as London 1990 and 1991 (see below).
I have assigned a new DataFrame
df_sum = test_data.loc[[1,2,3]].sum(level=1)
df_sum
1990 1991 1992 1993 1994
Berlin 99 100 101 102 103
Paris 99 97 98 99 100
London 104 103 102 102 102
and set
mask = (df_sum < 98) | (df_sum > 102)
mask
1990 1991 1992 1993 1994
Berlin False False False False True
Paris False True False False False
London True True False False False
df_sum
and mask
are non-multiindex DataFrames obviously and have the same dimensions as test_data.loc[1], ...
Now I would like to do something like
for var in [1,2]: test_data.loc[var][mask] = np.nan
I understand why this doesn't work and yields a SettingWithCopy warning. However, so far I didn't manage to figure out a(n elegant) way to do this. I found this thread (Pandas: Apply mask to multiindex dataframe) and thought this was probably the right direction but the difference is that there the mask has the same dimensions as the original multiindex dataframe.
Any help is much appreciated.
edit: I wouldn't think that this would be an elegant solution but it doesn't even work and I really don't understand this behaviour:
for var in range(1,4):
tmp = test_data.loc[var].copy()
tmp[test_mask] = np.nan
test_data.loc[var] = tmp.copy()
This leads to test_data.loc[1]
, ...loc[2]
and ...loc[3]
being all NaNs although tmp
has only 4 NaNs after applying test_mask
.
CodePudding user response:
You're getting the warning because you're working with views, not copies, but if I understand correctly all you need is this:
# you had this bit, but groupby syntax is preferred
df_sum = test_data.loc[[1,2,3]].groupby(level=1).sum()
for city, years in ((df_sum < 98) | (df_sum > 102)).iterrows():
# get the years for which condition is True
for year in years[years].index:
test_data.loc[(slice(1,2), city), year] = np.nan
This uses the slice
syntax for Multi-index selection, you can read more about it here.