Comparing and dropping columns based on greater or smaller timestamp-CodePudding

I have this df:

id        started                    completed
1         2022-02-20 15:00:10.157    2022-02-20 15:05:10.044

and I have this other one data:

        timestamp                   x       y
    2022-02-20 14:59:47.329     16      0.0
    2022-02-20 15:01:10.347     16      0.2
    2022-02-20 15:06:35.362     16      0.3

what I wanna do is filter the rows in data where timestamp > started and timestamp < completed (which will leave me with the middle row only)

I tried to do it like this:

res = data[(data['timestamp'] > '2022-02-20 15:00:10.157')]
res = res[(res['timestamp'] > '2022-02-20 15:05:10.044')]

and it works.

But when I wanted to combine the two like this:

res = data[(data['timestamp'] > df['started']) and (data['timestamp'] < df['completed'])]

I get ValueError: Can only compare identically-labeled Series objects

Can anyone please explain why and where am I doing the mistake? Do I have to convert to string the df['started'] or something?

CodePudding user response：

To make the comparisons, Pandas need to have the same rows count in both the dataframes, that's because a comparison is made between the first row of the data['timestamp'] series and the first row of the df['started'] series, and so on.

The error is due to the second row of the data['timestamp'] series not having anything to compare with.

In order to make the code work, you can add for any row of data, a row in df to match against. In this way, Pandas will return a Boolean result for every row, and you can use the AND logical operator to get the results that are both True.

Pandas doesn't want Python's and operator, so you need to use the & operator, so your code will look like this:

data[(data['timestamp'] > df['started']) & (data['timestamp'] < df['completed'])]

CodePudding user response：

You have two issues here.

The first is the use of and. If you want to combine multiple masks (boolean array) with a "and" logic element-wise, you want to use & instead of and.

Then, the use of df['started'] and df['completed'] for comparing. If you use a debugger, you can see that df['started'] is a dataframe with its own indexes, the same for data['timestamp']. The rule for comparing, two dataframes are described here. Essentially, you can compare only two dataframes with the same indexing. But here df has only one row, data multiple. Try convert your element from df as a non dataframe format. Using loc for instance.

For instance :

Using masks

n = 10
np.random.seed(0)
df = pd.DataFrame(
    {
        "x": np.random.choice(np.array([*ascii_lowercase]), size=n),
        "y": np.random.normal(size=n),
    }
)
df2 = pd.DataFrame(
    {
        "max_val" : [0],
        "min_val" : [-0.5]
    }
)

df[(df.y < df2.loc[0, 'max_val']) & (df.y > df2.loc[0, 'min_val'])]
Out[95]: 
   x         y
2  v -0.055035
3  a -0.107310
5  d -0.097696
7  j -0.453056
8  t -0.470771

Using query

df.query("y < @df2.max_val.to_numpy() and y > @df2.min_val.to_numpy()")
Out[94]: 
   x         y
2  v -0.055035
3  a -0.107310
5  d -0.097696
7  j -0.453056
8  t -0.470771