I am having huge problems with the time format in pandas and python. It seems that at some level there is some incompatibility that I cannot solve...
I am trying to select lines, from a panda dataframe that overlap with a given date.
Let's initialize a random data frame:
import random
import pandas as pd
import numpy as np
from datetime import timedelta
def random_dates(start, end, n, unit='D', seed=None):
if not seed: # from piR's answer
np.random.seed(0)
ndays = (end - start).days 1
return pd.to_timedelta(np.random.rand(n) * ndays, unit=unit) start
np.random.seed(0)
start = pd.to_datetime('2015-01-01')
end = pd.to_datetime('2018-01-01')
rd=random_dates(start, end, 1000)
end_date = (rd timedelta(days=4))
a_pd=pd.DataFrame({"idx":np.random.choice(50, 1000), "start":rd, "end":end_date})
Now I create a single interval date using the same function I used to "rd".
start = pd.to_datetime('2016-01-01')
end = pd.to_datetime('2016-01-02')
rd_target=random_dates(start, end, 1)
To see the overlap of the dates of my list with the target date "rd_target" I simply compare the "end" of my dataframe with the "start" of my rd_target, then the "start" of my dataframe with the "end" of the rd_target.
a_pd[a_pd["end"] > rd_target[0] & a_pd["start"]<rd_target[1] ]
But python doesn't like it. Here the error message:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-63-dab78e1a862e> in <module>
----> 1 a_pd[a_pd["end"] > rd[0] & a_pd["start"]<rd[1] ]
~/venv/lib/python3.8/site-packages/pandas/core/ops/common.py in new_method(self, other)
63 other = item_from_zerodim(other)
64
---> 65 return method(self, other)
66
67 return new_method
~/venv/lib/python3.8/site-packages/pandas/core/ops/__init__.py in wrapper(self, other)
392 rvalues = extract_array(other, extract_numpy=True)
393
--> 394 res_values = logical_op(lvalues, rvalues, op)
395 return self._construct_result(res_values, name=res_name)
396
~/venv/lib/python3.8/site-packages/pandas/core/ops/array_ops.py in logical_op(left, right, op)
339 if should_extension_dispatch(lvalues, rvalues):
340 # Call the method on lvalues
--> 341 res_values = op(lvalues, rvalues)
342
343 else:
~/venv/lib/python3.8/site-packages/pandas/core/ops/roperator.py in rand_(left, right)
50
51 def rand_(left, right):
---> 52 return operator.and_(right, left)
53
54
TypeError: unsupported operand type(s) for &: 'Timestamp' and 'DatetimeArray'
Any help please?
CodePudding user response:
If you want to have 2 elements in your rd_target
time range, you need to change the line where you initialise it to:
rd_target=random_dates(start, end, 2)
The way wrote it, rd_target
has only one element, you won't be able to access the index 1.
Moreover, you need brackets on the dataframe filter expression, like so:
a_pd[(a_pd["end"] > rd_target[0]) & (a_pd["start"] < rd_target[1])]
For more info, regarding opearator precedence, please visit the official docs.