Home > Enterprise >  np.select not working with datetime64[ns]
np.select not working with datetime64[ns]

Time:11-07

I have a series with datetime values, with timezone.

If i use np.select to:

  • if hour after 12 -> return next day
  • if hour before 11 -> return same day
  • else return np.nan

With the datetime values with timezones, it works. However, if I use np.select after removing timezones, it gives me the following error:

TypeError: Choicelists and default value do not have a common dtype: The DType <class 'numpy.dtype[datetime64]'> could not be promoted by <class 'numpy.dtype[float64]'>. This means that no common DType exists for the given inputs. For example they cannot be stored in a single array unless the dtype is `object`. The full list of DTypes is: (<class 'numpy.dtype[datetime64]'>, <class 'numpy.dtype[float64]'>)

Here is my code:

import pandas as pd 
import numpy as np
from datetime import timedelta
import datetime

datetime_series = pd.Series(['2022-09-24 22:00:00 02:00','2022-09-04 11:30:00 02:00', '2022-11-11 02:20:30 02:00',  '2022-11-12 03:20:30 02:00'])
 #make datetime
datetime_series = pd.to_datetime(datetime_series, errors='coerce')
 #remove timezone
datetime_series_no_timezone = datetime_series.dt.tz_localize(None)

print ('datetime_series dtype: ', datetime_series.dtype)
print ('datetime_series_no_timezone dtype: ', datetime_series_no_timezone.dtype)

 # with timezone it works
conditions = [
        datetime_series.dt.hour > 12,
        datetime_series.dt.hour < 11]
choiches = [
        (datetime_series   datetime.timedelta(days=1)),
        datetime_series ]

print (np.select(conditions, choiches, default=np.nan))

 # without timezone it doesn't 
conditions = [
        datetime_series_no_timezone.dt.hour > 12,
        datetime_series_no_timezone.dt.hour < 11]
choiches = [
        (datetime_series_no_timezone   datetime.timedelta(days=1)),
        datetime_series_no_timezone ]
print (np.select(conditions, choiches, default=np.nan))

OUT:

datetime_series dtype:  datetime64[ns, pytz.FixedOffset(120)]
datetime_series_no_timezone dtype:  datetime64[ns]
[Timestamp('2022-09-25 22:00:00 0200', tz='pytz.FixedOffset(120)') nan
 Timestamp('2022-11-11 02:20:30 0200', tz='pytz.FixedOffset(120)')
 Timestamp('2022-11-12 03:20:30 0200', tz='pytz.FixedOffset(120)')]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-1-63a5d209c9ba> in <module>
     30         (datetime_series_no_timezone   datetime.timedelta(days=1)),
     31         datetime_series_no_timezone ]
---> 32 print (np.select(conditions, choiches, default=np.nan))

<__array_function__ internals> in select(*args, **kwargs)

/usr/local/lib/python3.7/dist-packages/numpy/lib/function_base.py in select(condlist, choicelist, default)
    687     except TypeError as e:
    688         msg = f'Choicelists and default value do not have a common dtype: {e}'
--> 689         raise TypeError(msg) from None
    690 
    691     # Convert conditions to arrays and broadcast conditions and choices

CodePudding user response:

Use a suitable default to avoid the error; here you can use pandas' not-a-time value:

import pandas as pd
import numpy as np

datetime_series = pd.Series(
    [
        "2022-09-24 22:00:00 02:00",
        "2022-09-04 11:30:00 02:00",
        "2022-11-11 02:20:30 02:00",
        "2022-11-12 03:20:30 02:00",
    ]
)
datetime_series = pd.to_datetime(datetime_series, errors="coerce")
datetime_series_no_timezone = datetime_series.dt.tz_localize(None)

conditions = [
    datetime_series_no_timezone.dt.hour > 12,
    datetime_series_no_timezone.dt.hour < 11,
]
choiches = [
    (datetime_series_no_timezone   pd.Timedelta(days=1)),
    datetime_series_no_timezone,
]

s = np.select(conditions, choiches, default=pd.NaT)
print(s)
# [1664143200000000000 NaT 1668133230000000000 1668223230000000000]

Note that the datetime values are converted to integer nanoseconds since the Unix epoch in the process.


As to why this happens, if you look at the numpy representation of the series,

datetime_series.to_numpy()
Out[26]: 
array([Timestamp('2022-09-24 22:00:00 0200', tz='pytz.FixedOffset(120)'),
       Timestamp('2022-09-04 11:30:00 0200', tz='pytz.FixedOffset(120)'),
       Timestamp('2022-11-11 02:20:30 0200', tz='pytz.FixedOffset(120)'),
       Timestamp('2022-11-12 03:20:30 0200', tz='pytz.FixedOffset(120)')],
      dtype=object)

datetime_series_no_timezone.to_numpy()
Out[27]: 
array(['2022-09-24T22:00:00.000000000', '2022-09-04T11:30:00.000000000',
       '2022-11-11T02:20:30.000000000', '2022-11-12T03:20:30.000000000'],
      dtype='datetime64[ns]')

you see that in the first case, numpy cannot find a suitable dtype and uses 'object' due to the pytz fixed offset. In the second case, numpy determines datetime as dtype. I'd assume that when those arrays are passed to np.select, in the first case no attempt is made to coerce common dtypes since it's object (anything!). In the second case, such an attempt is made, which fails, np.nan is dtype float while datetime is datetime or int if converted to Unix time nanoseconds.

  • Related