I have this df
:
id started completed
1 2022-02-20 15:00:10.157 2022-02-20 15:05:10.044
and I have this other one data
:
timestamp x y
2022-02-20 14:59:47.329 16 0.0
2022-02-20 15:01:10.347 16 0.2
2022-02-20 15:06:35.362 16 0.3
what I wanna do is filter the rows in data
where timestamp > started and timestamp < completed
(which will leave me with the middle row only)
I tried to do it like this:
res = data[(data['timestamp'] > '2022-02-20 15:00:10.157')]
res = res[(res['timestamp'] > '2022-02-20 15:05:10.044')]
and it works.
But when I wanted to combine the two like this:
res = data[(data['timestamp'] > df['started']) and (data['timestamp'] < df['completed'])]
I get ValueError: Can only compare identically-labeled Series objects
Can anyone please explain why and where am I doing the mistake? Do I have to convert to string the df['started']
or something?
CodePudding user response:
To make the comparisons, Pandas need to have the same rows count in both the dataframes, that's because a comparison is made between the first row of the data['timestamp']
series and the first row of the df['started']
series, and so on.
The error is due to the second row of the data['timestamp']
series not having anything to compare with.
In order to make the code work, you can add for any row of data
, a row in df
to match against. In this way, Pandas will return a Boolean result for every row, and you can use the AND logical operator to get the results that are both True.
Pandas doesn't want Python's and
operator, so you need to use the &
operator, so your code will look like this:
data[(data['timestamp'] > df['started']) & (data['timestamp'] < df['completed'])]
CodePudding user response:
You have two issues here.
The first is the use of and
. If you want to combine multiple masks (boolean array) with a "and" logic element-wise, you want to use &
instead of and
.
Then, the use of df['started']
and df['completed']
for comparing. If you use a debugger, you can see that
df['started']
is a dataframe with its own indexes, the same for data['timestamp']
. The rule for comparing, two dataframes are described here. Essentially, you can compare only two dataframes with the same indexing. But here df
has only one row, data
multiple. Try convert your element from df
as a non dataframe format. Using loc
for instance.
For instance :
Using masks
n = 10
np.random.seed(0)
df = pd.DataFrame(
{
"x": np.random.choice(np.array([*ascii_lowercase]), size=n),
"y": np.random.normal(size=n),
}
)
df2 = pd.DataFrame(
{
"max_val" : [0],
"min_val" : [-0.5]
}
)
df[(df.y < df2.loc[0, 'max_val']) & (df.y > df2.loc[0, 'min_val'])]
Out[95]:
x y
2 v -0.055035
3 a -0.107310
5 d -0.097696
7 j -0.453056
8 t -0.470771
Using query
df.query("y < @df2.max_val.to_numpy() and y > @df2.min_val.to_numpy()")
Out[94]:
x y
2 v -0.055035
3 a -0.107310
5 d -0.097696
7 j -0.453056
8 t -0.470771