I have a dataframe df
and a list of floats T
. df.B
is a time series of values sorted in chronological order, where the 0th index is the most recent timestamp and the last index is the oldest timestamp.
df = pd.DataFrame({'A': [1.1, 2.2, 3.3, 4.4], 'B': [5.5, 6.6, 7.7, 8.8]})
T = [t1, t2, ..., tn] # floats
What I am looking to do
I would like to compare the values of column B
against the list of values T
, one t
at a time, and return the first row of df
that satisfies the condition against t
. By first row of df
I mean walk through the timeseries (essentially) and find the first instance in time where the values in df.B
become larger than the value t
for any t
in T
.
What I've attempted:
df.loc[df.apply(lambda x: x.B >= T, axis=1)]
# => TypeError: unhashable type: 'numpy.ndarray'
df2 = df.query('B >= @T')
# => 'Lengths must match to compare'
[ df[df['B'] >= t] for t in T ]
# => Technically this works and then I can iterate again to retrieve the first row, but I get the warning -- pydevd warning: Computing repr of a (list) was slow
EDIT, an example:
T = [3.5, 4.5, 8.0, 8.5, 10.0, 11.0]
df.B = [5.5, 8.8, 6.6, 7.7]
# I'm hoping that the expected output would have the rows corresponding to the following values in `df.B`:
[7.7, 7.7, 8.8, 8.8, None, None]
CodePudding user response:
You can sort values in column B, and then use numpy.searchsorted
:
import numpy as np
sorted_values = np.sort(df.B)
np.append(sorted_values, np.nan)[np.searchsorted(sorted_values, T)]
# array([5.5, 5.5, 5.5, 7.7, nan, nan])
To get data frame rows, first find the indices:
indices = np.searchsorted(df.B, T)
indices
# array([0, 0, 0, 2, 4, 4], dtype=int32)
Then retrieve corresponding rows:
[df.iloc[i] if i < len(df) else None for i in indices]
[A 1.1
B 5.5
Name: 0, dtype: float64, A 1.1
B 5.5
Name: 0, dtype: float64, A 1.1
B 5.5
Name: 0, dtype: float64, A 3.3
B 7.7
Name: 2, dtype: float64, None, None]