Home > Mobile >  pd.cut with datetime IntervalIndex as bins
pd.cut with datetime IntervalIndex as bins

Time:08-15

From followig code, I am expecting these timestamps to be binned into periods provided through an IntervalIndex. Unfortunately, I only get NaN returned. Please, what is the trouble?

import pandas as pd

# Test data
ts = [pd.Timestamp('2022/03/01 09:00'),
      pd.Timestamp('2022/03/01 10:00'),
      pd.Timestamp('2022/03/01 10:30'),
      pd.Timestamp('2022/03/01 15:00')]
df = pd.DataFrame({'a':range(len(ts)), 'ts': ts})
# Test
bins = pd.interval_range(pd.Timestamp('2022/03/01 08:00'),
                         pd.Timestamp('2022/03/01 16:00'),
                         freq='2H',
                         closed="left")
row_labels = pd.cut(df["ts"], bins)

I am expecting the result to be:

[2022-03-01 08:00:00, 2022-03-01 10:00:00)
[2022-03-01 10:00:00, 2022-03-01 12:00:00)
[2022-03-01 10:00:00, 2022-03-01 12:00:00)
[2022-03-01 14:00:00, 2022-03-01 16:00:00)

But I only get NaN.

row_labels
Out[37]: 
0    NaN
1    NaN
2    NaN
3    NaN
Name: ts, dtype: category
Categories (4, interval[datetime64[ns], left]): [ <
                                                 [2022-03-01 08:00:00, 2022-03-01 10:00:00) <
                                                 [2022-03-01 10:00:00, 2022-03-01 12:00:00) <
                                                 [2022-03-01 12:00:00, 2022-03-01 14:00:00) <
                                                 [2022-03-01 14:00:00, 2022-03-01 16:00:00)]

Please, what is the trouble? Thanks for your help. Bests,

CodePudding user response:

Very interesting...

pd.cut(df['ts'].to_list(), bins)

produces the expected result

[[2022-03-01 08:00:00, 2022-03-01 10:00:00), 
 [2022-03-01 10:00:00, 2022-03-01 12:00:00), 
 [2022-03-01 10:00:00, 2022-03-01 12:00:00), 
 [2022-03-01 14:00:00, 2022-03-01 16:00:00)]

Categories (4, interval[datetime64[ns], left]): [
                [2022-03-01 08:00:00, 2022-03-01 10:00:00) < 
                [2022-03-01 10:00:00, 2022-03-01 12:00:00) < 
                [2022-03-01 12:00:00, 2022-03-01 14:00:00) < 
                [2022-03-01 14:00:00, 2022-03-01 16:00:00)]

BUT!

pd.cut(df['ts'].to_numpy(), bins)
[NaN, NaN, NaN, NaN]

Categories (4, interval[datetime64[ns], left]): [
                [2022-03-01 08:00:00, 2022-03-01 10:00:00) < 
                [2022-03-01 10:00:00, 2022-03-01 12:00:00) < 
                [2022-03-01 12:00:00, 2022-03-01 14:00:00) < 
                [2022-03-01 14:00:00, 2022-03-01 16:00:00)]

What??

Why does it work with a list but DOESN'T work with np.ndarray or pd.Series?


ALSO :

bins_dt_index = pd.date_range(pd.Timestamp('2022/03/01 08:00'),
                              pd.Timestamp('2022/03/01 16:00'),
                              freq='2H')
bins_dt_index
DatetimeIndex(['2022-03-01 08:00:00', '2022-03-01 10:00:00',
               '2022-03-01 12:00:00', '2022-03-01 14:00:00',
               '2022-03-01 16:00:00'],
              dtype='datetime64[ns]', freq='2H')
pd.cut(df['ts'].to_list(), bins_dt_index, right=False)

produces

TypeError: '<' not supported between instances of 'int' and 'Timestamp'

while

pd.cut(df['ts'], bins_dt_index, right=False)

produces the expected result!

0    [2022-03-01 08:00:00, 2022-03-01 10:00:00)
1    [2022-03-01 10:00:00, 2022-03-01 12:00:00)
2    [2022-03-01 10:00:00, 2022-03-01 12:00:00)
3    [2022-03-01 14:00:00, 2022-03-01 16:00:00)
Name: ts, dtype: category

Categories (4, interval[datetime64[ns], left]): [
                [2022-03-01 08:00:00, 2022-03-01 10:00:00) < 
                [2022-03-01 10:00:00, 2022-03-01 12:00:00) < 
                [2022-03-01 12:00:00, 2022-03-01 14:00:00) < 
                [2022-03-01 14:00:00, 2022-03-01 16:00:00)]

So DatetimeIndex works with np.ndarray and pd.Series but DOESN'T work with a list!

And IntervalIndex - vice versa!

Shouldn't they all work the same? I mean, pd.cut clearly states that x can be a 1-dimensional array-like.

It would be great if someone explained why this happens !

  • Related