Home > Software design >  Using numpy digitize with datetime data
Using numpy digitize with datetime data

Time:11-04

I am trying to bin a dataset consisting of a long time series of measurements into a number of discrete bins. The times that the measurements are made are held in a numpy array of datetime objects t_data.

I generate the bin edges as an array of datetime objects as well t_edges.

When I print out both arrays their contents display as a series of datetime.datetime(...) items.

I then try to assign each measurement in t_data to the relevant bin using:

t_bin = np.digitize(t_data, t_edges)

However, this results in the following error:

  File "<__array_function__ internals>", line 5, in digitize
  File "python3.9/site-packages/numpy/lib/function_base.py", line 4922, in digitize
    mono = _monotonicity(bins)
TypeError: Cannot cast array data from dtype('O') to dtype('float64') according to the rule 'safe'

It seems to be an issue with the datatypes, but I have done some searching and am not sure how to correct this. It seems that one is being classed as an 'object', 'O', whilst the other is a float? Reading the error message I note that the data series and bins should all be increasing monotonically, but perhaps being a datetime confuses this? I am aware of this question that seems to have a similar issue with datetime64, but did not receive an answer.

If anyone can give me something to try and resolve this to make it work (or tell me if it is impossible to use np.digitize() with datetime series) I would be grateful.

Minimal working example:

from datetime import datetime, timedelta
import numpy as np

sdate = datetime.strptime('2017-01-01 18:00:00', "%Y-%m-%d %H:%M:%S")
edate = datetime.strptime('2017-01-01 18:00:30', "%Y-%m-%d %H:%M:%S")

t_data = np.array([sdate   timedelta(minutes=x) for x in range((edate - sdate).seconds)])

t_edges = np.array([datetime.strptime('2017-01-01 18:00:00', "%Y-%m-%d %H:%M:%S"),
                   datetime.strptime('2017-01-01 18:00:10',"%Y-%m-%d %H:%M:%S"),
                   datetime.strptime('2017-01-01 18:00:20', "%Y-%m-%d %H:%M:%S")])

t_bin = np.digitize(t_data, t_edges)

I'd be expecting the result to be of the form [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, ....., 3, 3, 3, 3, 3]

CodePudding user response:

Seems like Numpy is treating your datetime objects as mere object.

I would suggest casting your datetime objects to timestamps before applying np.digitize.

Example:

from datetime import datetime, timedelta
import numpy as np

sdate = datetime.strptime('2017-01-01 18:00:00', "%Y-%m-%d %H:%M:%S")
edate = datetime.strptime('2017-01-01 18:00:30', "%Y-%m-%d %H:%M:%S")

t_data = np.array([(sdate   timedelta(seconds=x)) for x in range((edate - sdate).seconds)])

t_edges = np.array([datetime.strptime('2017-01-01 18:00:00', "%Y-%m-%d %H:%M:%S"),
                   datetime.strptime('2017-01-01 18:00:10',"%Y-%m-%d %H:%M:%S"),
                   datetime.strptime('2017-01-01 18:00:20', "%Y-%m-%d %H:%M:%S")])

t_data_ts = [datetime.timestamp(t) for t in t_data]
t_edges_ts = [datetime.timestamp(t) for t in t_edges]

t_bin = np.digitize(t_data_ts, t_edges_ts)

There were some bugs in your code that I fixed.

  • Related