Home > other >  Strange assert/comparison behavior with single PeriodIndex object from Pandas series
Strange assert/comparison behavior with single PeriodIndex object from Pandas series

Time:07-23

When trying to make a comparison with a PeriodIndex type object and subsequently writing a few data tests involving them I came across some confusing behavior. Any clarification on why there is a difference in the way these statements are processed would be great.

> d = {'col1': ['a','b','c'], 'col2':pd.PeriodIndex(data=['2022-01-01','2021-05-01','2020-10-01'], freq='Q')}
> tdf = pd.DataFrame.from_records(data=d)

> print(tdf.dtypes)
col1           object
col2    period[Q-DEC]
dtype: object

> print(tdf.col2[0])
2022Q1
> print(tdf.col2[0] == '2022Q1')
False
> print(tdf[tdf.col1 == 'a'].col2 == '2022Q1')
0    True

> assert tdf[tdf.col1 == 'a'].col2 == '2022Q1', 'doesnt match'
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

> assert all(tdf[tdf.col1 == 'a'].col2 == '2022Q1'), 'doesnt match'
# Passes no big deal

Accessing the same object in a different way makes the comparison go from False to True, and then wrapping the single assertion in all helps it to pass.

CodePudding user response:

In the first case tdf.col2[0] == '2022Q1', the comparison is done using the Python == operator and the result is False as you compare a Period (pd.Period('2022Q1')) to a string (the same way as 2 == '2' yields False).

In the second case tdf[tdf.col1 == 'a'].col2 == '2022Q1', you compare a Series to a string, which uses pandas comparison operations, see the following call stack:

  ...\lib\site-packages\pandas\core\ops\common.py(70)new_method()
-> return method(self, other)
  ...\lib\site-packages\pandas\core\arraylike.py(40)__eq__()
-> return self._cmp_method(other, operator.eq)
  ...\lib\site-packages\pandas\core\series.py(5623)_cmp_method()
-> res_values = ops.comparison_op(lvalues, rvalues, op)
  ...\lib\site-packages\pandas\core\ops\array_ops.py(269)comparison_op()
-> res_values = op(lvalues, rvalues)
  ...\lib\site-packages\pandas\core\ops\common.py(70)new_method()
-> return method(self, other)
  ...\lib\site-packages\pandas\core\arraylike.py(40)__eq__()
-> return self._cmp_method(other, operator.eq)
  ...\lib\site-packages\pandas\core\arrays\datetimelike.py(1008)_cmp_method()
-> other = self._validate_comparison_value(other)
  ...\lib\site-packages\pandas\core\arrays\datetimelike.py(528)_validate_comparison_value()
-> other = self._scalar_from_string(other)
> ...\lib\site-packages\pandas\core\arrays\period.py(331)_scalar_from_string()
-> return Period(value, freq=self.freq)

As you see, the string '2022Q1' gets converted to a Period before carrying out the actual comparison and hence the comparison of Period('2022Q1') with Period('2022Q1') is True.

I'm not sure if this conversion is intended behavior or a bug.



As for the assert part of your question: comparing a Series to something results in a boolean Series. When using a Series of booleans in a condition (if or assert), the result must be exactly one True or False value, not a Series of True or False values. As the error message says, you need to decide how you want to reduce the Series of booleans to a single boolean. In you special case the Series has only one element, but nevertheless it's a Series, hence the error (for a Series of length 1 it makes of course no difference if you use any or all or just [0]).

  • Related