From followig code, I am expecting these timestamps to be binned into periods provided through an IntervalIndex
.
Unfortunately, I only get NaN
returned.
Please, what is the trouble?
import pandas as pd
# Test data
ts = [pd.Timestamp('2022/03/01 09:00'),
pd.Timestamp('2022/03/01 10:00'),
pd.Timestamp('2022/03/01 10:30'),
pd.Timestamp('2022/03/01 15:00')]
df = pd.DataFrame({'a':range(len(ts)), 'ts': ts})
# Test
bins = pd.interval_range(pd.Timestamp('2022/03/01 08:00'),
pd.Timestamp('2022/03/01 16:00'),
freq='2H',
closed="left")
row_labels = pd.cut(df["ts"], bins)
I am expecting the result to be:
[2022-03-01 08:00:00, 2022-03-01 10:00:00)
[2022-03-01 10:00:00, 2022-03-01 12:00:00)
[2022-03-01 10:00:00, 2022-03-01 12:00:00)
[2022-03-01 14:00:00, 2022-03-01 16:00:00)
But I only get NaN
.
row_labels
Out[37]:
0 NaN
1 NaN
2 NaN
3 NaN
Name: ts, dtype: category
Categories (4, interval[datetime64[ns], left]): [ <
[2022-03-01 08:00:00, 2022-03-01 10:00:00) <
[2022-03-01 10:00:00, 2022-03-01 12:00:00) <
[2022-03-01 12:00:00, 2022-03-01 14:00:00) <
[2022-03-01 14:00:00, 2022-03-01 16:00:00)]
Please, what is the trouble? Thanks for your help. Bests,
CodePudding user response:
Very interesting...
pd.cut(df['ts'].to_list(), bins)
produces the expected result
[[2022-03-01 08:00:00, 2022-03-01 10:00:00),
[2022-03-01 10:00:00, 2022-03-01 12:00:00),
[2022-03-01 10:00:00, 2022-03-01 12:00:00),
[2022-03-01 14:00:00, 2022-03-01 16:00:00)]
Categories (4, interval[datetime64[ns], left]): [
[2022-03-01 08:00:00, 2022-03-01 10:00:00) <
[2022-03-01 10:00:00, 2022-03-01 12:00:00) <
[2022-03-01 12:00:00, 2022-03-01 14:00:00) <
[2022-03-01 14:00:00, 2022-03-01 16:00:00)]
BUT!
pd.cut(df['ts'].to_numpy(), bins)
[NaN, NaN, NaN, NaN]
Categories (4, interval[datetime64[ns], left]): [
[2022-03-01 08:00:00, 2022-03-01 10:00:00) <
[2022-03-01 10:00:00, 2022-03-01 12:00:00) <
[2022-03-01 12:00:00, 2022-03-01 14:00:00) <
[2022-03-01 14:00:00, 2022-03-01 16:00:00)]
What??
Why does it work with a list but DOESN'T work with np.ndarray
or pd.Series
?
ALSO :
bins_dt_index = pd.date_range(pd.Timestamp('2022/03/01 08:00'),
pd.Timestamp('2022/03/01 16:00'),
freq='2H')
bins_dt_index
DatetimeIndex(['2022-03-01 08:00:00', '2022-03-01 10:00:00',
'2022-03-01 12:00:00', '2022-03-01 14:00:00',
'2022-03-01 16:00:00'],
dtype='datetime64[ns]', freq='2H')
pd.cut(df['ts'].to_list(), bins_dt_index, right=False)
produces
TypeError: '<' not supported between instances of 'int' and 'Timestamp'
while
pd.cut(df['ts'], bins_dt_index, right=False)
produces the expected result!
0 [2022-03-01 08:00:00, 2022-03-01 10:00:00)
1 [2022-03-01 10:00:00, 2022-03-01 12:00:00)
2 [2022-03-01 10:00:00, 2022-03-01 12:00:00)
3 [2022-03-01 14:00:00, 2022-03-01 16:00:00)
Name: ts, dtype: category
Categories (4, interval[datetime64[ns], left]): [
[2022-03-01 08:00:00, 2022-03-01 10:00:00) <
[2022-03-01 10:00:00, 2022-03-01 12:00:00) <
[2022-03-01 12:00:00, 2022-03-01 14:00:00) <
[2022-03-01 14:00:00, 2022-03-01 16:00:00)]
So DatetimeIndex
works with np.ndarray
and pd.Series
but DOESN'T work with a list!
And IntervalIndex
- vice versa!
Shouldn't they all work the same? I mean, pd.cut
clearly states that x
can be a 1-dimensional array-like
.
It would be great if someone explained why this happens !