When trying to make a comparison with a PeriodIndex type object and subsequently writing a few data tests involving them I came across some confusing behavior. Any clarification on why there is a difference in the way these statements are processed would be great.
> d = {'col1': ['a','b','c'], 'col2':pd.PeriodIndex(data=['2022-01-01','2021-05-01','2020-10-01'], freq='Q')}
> tdf = pd.DataFrame.from_records(data=d)
> print(tdf.dtypes)
col1 object
col2 period[Q-DEC]
dtype: object
> print(tdf.col2[0])
2022Q1
> print(tdf.col2[0] == '2022Q1')
False
> print(tdf[tdf.col1 == 'a'].col2 == '2022Q1')
0 True
> assert tdf[tdf.col1 == 'a'].col2 == '2022Q1', 'doesnt match'
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
> assert all(tdf[tdf.col1 == 'a'].col2 == '2022Q1'), 'doesnt match'
# Passes no big deal
Accessing the same object in a different way makes the comparison go from False to True, and then wrapping the single assertion in all
helps it to pass.
CodePudding user response:
In the first case tdf.col2[0] == '2022Q1'
, the comparison is done using the Python ==
operator and the result is False
as you compare a Period (pd.Period('2022Q1')
) to a string (the same way as 2 == '2'
yields False
).
In the second case tdf[tdf.col1 == 'a'].col2 == '2022Q1'
, you compare a Series to a string, which uses pandas comparison operations, see the following call stack:
...\lib\site-packages\pandas\core\ops\common.py(70)new_method()
-> return method(self, other)
...\lib\site-packages\pandas\core\arraylike.py(40)__eq__()
-> return self._cmp_method(other, operator.eq)
...\lib\site-packages\pandas\core\series.py(5623)_cmp_method()
-> res_values = ops.comparison_op(lvalues, rvalues, op)
...\lib\site-packages\pandas\core\ops\array_ops.py(269)comparison_op()
-> res_values = op(lvalues, rvalues)
...\lib\site-packages\pandas\core\ops\common.py(70)new_method()
-> return method(self, other)
...\lib\site-packages\pandas\core\arraylike.py(40)__eq__()
-> return self._cmp_method(other, operator.eq)
...\lib\site-packages\pandas\core\arrays\datetimelike.py(1008)_cmp_method()
-> other = self._validate_comparison_value(other)
...\lib\site-packages\pandas\core\arrays\datetimelike.py(528)_validate_comparison_value()
-> other = self._scalar_from_string(other)
> ...\lib\site-packages\pandas\core\arrays\period.py(331)_scalar_from_string()
-> return Period(value, freq=self.freq)
As you see, the string '2022Q1'
gets converted to a Period
before carrying out the actual comparison and hence the comparison of Period('2022Q1')
with Period('2022Q1')
is True
.
I'm not sure if this conversion is intended behavior or a bug.
As for the assert part of your question: comparing a Series to something results in a boolean Series. When using a Series of booleans in a condition (if
or assert
), the result must be exactly one True
or False
value, not a Series of True
or False
values. As the error message says, you need to decide how you want to reduce the Series of booleans to a single boolean. In you special case the Series has only one element, but nevertheless it's a Series, hence the error (for a Series of length 1 it makes of course no difference if you use any
or all
or just [0]
).