Home > Mobile >  pandas select rows on date
pandas select rows on date

Time:10-24

I am having huge problems with the time format in pandas and python. It seems that at some level there is some incompatibility that I cannot solve...

I am trying to select lines, from a panda dataframe that overlap with a given date.

Let's initialize a random data frame:

import random
import pandas as pd
import numpy as np
from datetime import timedelta

def random_dates(start, end, n, unit='D', seed=None):
    if not seed:  # from piR's answer
        np.random.seed(0)

    ndays = (end - start).days   1
    return pd.to_timedelta(np.random.rand(n) * ndays, unit=unit)   start

np.random.seed(0)
start = pd.to_datetime('2015-01-01')
end = pd.to_datetime('2018-01-01')
rd=random_dates(start, end, 1000)
end_date = (rd timedelta(days=4))
a_pd=pd.DataFrame({"idx":np.random.choice(50, 1000), "start":rd, "end":end_date})

Now I create a single interval date using the same function I used to "rd".

start = pd.to_datetime('2016-01-01')
end = pd.to_datetime('2016-01-02')
rd_target=random_dates(start, end, 1)

To see the overlap of the dates of my list with the target date "rd_target" I simply compare the "end" of my dataframe with the "start" of my rd_target, then the "start" of my dataframe with the "end" of the rd_target.

a_pd[a_pd["end"] > rd_target[0]  & a_pd["start"]<rd_target[1] ]

But python doesn't like it. Here the error message:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-63-dab78e1a862e> in <module>
----> 1 a_pd[a_pd["end"] > rd[0]  & a_pd["start"]<rd[1] ]

~/venv/lib/python3.8/site-packages/pandas/core/ops/common.py in new_method(self, other)
     63         other = item_from_zerodim(other)
     64 
---> 65         return method(self, other)
     66 
     67     return new_method

~/venv/lib/python3.8/site-packages/pandas/core/ops/__init__.py in wrapper(self, other)
    392         rvalues = extract_array(other, extract_numpy=True)
    393 
--> 394         res_values = logical_op(lvalues, rvalues, op)
    395         return self._construct_result(res_values, name=res_name)
    396 

~/venv/lib/python3.8/site-packages/pandas/core/ops/array_ops.py in logical_op(left, right, op)
    339     if should_extension_dispatch(lvalues, rvalues):
    340         # Call the method on lvalues
--> 341         res_values = op(lvalues, rvalues)
    342 
    343     else:

~/venv/lib/python3.8/site-packages/pandas/core/ops/roperator.py in rand_(left, right)
     50 
     51 def rand_(left, right):
---> 52     return operator.and_(right, left)
     53 
     54 

TypeError: unsupported operand type(s) for &: 'Timestamp' and 'DatetimeArray'

Any help please?

CodePudding user response:

If you want to have 2 elements in your rd_target time range, you need to change the line where you initialise it to:

rd_target=random_dates(start, end, 2)

The way wrote it, rd_target has only one element, you won't be able to access the index 1.

Moreover, you need brackets on the dataframe filter expression, like so:

a_pd[(a_pd["end"] > rd_target[0]) & (a_pd["start"] < rd_target[1])]

For more info, regarding opearator precedence, please visit the official docs.

  • Related