Home > other >  pandas.cut() with NA values causing "boolean value of NA is ambiguous"
pandas.cut() with NA values causing "boolean value of NA is ambiguous"

Time:01-23

I would like to understand why this code does raise an TypeError.

import pandas
pandas.cut(x=[1, 2, pandas.NA, 4, 5, 6, 7], bins=3)

The full error

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/user/.local/lib/python3.9/site-packages/pandas/core/reshape/tile.py", line 293, in cut
    fac, bins = _bins_to_cuts(
  File "/home/user/.local/lib/python3.9/site-packages/pandas/core/reshape/tile.py", line 428, in _bins_to_cuts
    ids = ensure_platform_int(bins.searchsorted(x, side=side))
  File "pandas/_libs/missing.pyx", line 382, in pandas._libs.missing.NAType.__bool__
TypeError: boolean value of NA is ambiguous

Of course the values containing missing (pandas.NA) values, too. But looking into the to docs in the section Notes.

Any NA values will be NA in the result. Out of bounds values will be NA in the resulting Series or Categorical object.

In my understanding of the docs this shouldn't throw an error.

CodePudding user response:

Looks like pd.cut doesn't behave consistently when it hits the (relatively new) pd.NA value, and it's not the only one.

Please take some time to report it and relate it to the main issue.

In the mean time you can work around it wrapping the values in an IntegerArray which allow null values for integer types:

# Using IntegerArray
In [1]: import pandas as pd

In [2]: pd.cut(x=pd.array([1, 2, pd.NA, 4, 5, 6, 7]), bins=3)
Out[2]:
[(0.994, 3.0], (0.994, 3.0], NaN, (3.0, 5.0], (3.0, 5.0], (5.0, 7.0], (5.0, 7.0]]
Categories (3, interval[float64, right]): [(0.994, 3.0] < (3.0, 5.0] < (5.0, 7.0]]

Or if you don't like using an experimental API you can use np.array, although this will change the dtype to float:

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: pd.cut(x=np.array([1, 2, np.nan, 4, 5, 6, 7]), bins=3)
Out[3]:
[(0.994, 3.0], (0.994, 3.0], NaN, (3.0, 5.0], (3.0, 5.0], (5.0, 7.0], (5.0, 7.0]]
Categories (3, interval[float64, right]): [(0.994, 3.0] < (3.0, 5.0] < (5.0, 7.0]]

Hope this helps.

CodePudding user response:

... to understand why this code does raise TypeError

It's in nature of pd.NA value (https://pandas.pydata.org/pandas-docs/version/1.0.0/user_guide/missing_data.html).

Experimental NA scalar to denote missing values

Warning Experimental: the behaviour of pd.NA can still change without warning.

...
In general, missing values propagate in operations involving pd.NA. When one of the operands is unknown, the outcome of the operation is also unknown.

...
In equality and comparison operations, pd.NA also propagates.

...
Since the actual value of an NA is unknown, it is ambiguous to convert NA to a boolean value. The following raises an error: TypeError: boolean value of NA is ambiguous

pd.cut uses np.searchsorted under the hood in it's internal _bins_to_cuts function which is in your case fails on line ids = ensure_platform_int(bins.searchsorted(x, side=side)) where x is one of the bins criteria (marker).

Then, diving into np.searchsorted: it internally does comparison operations like a[i-1] < v <= a[i]/a[i-1] <= v < a[i] to find the insertion indices.

So with your input list [1, 2, pd.NA, 4, 5, 6, 7] any comparison like <value> <= pd.NA will give pd.NA instead of logical value (True/False) ... and that's indeed ambiguous and fails with respective error.

In [372]: 1 <= pd.NA
Out[372]: <NA>
  • Related