I was looking through the pandas.query documentation but couldn't find anything specific about this. Is it possible to perform a query on a date based on the closest date to the one given, instead of a specific date?
For example lets say we use the wine dataset and creates some random dates.
import pandas as pd
import numpy as np
from sklearn import datasets
dir(datasets)
df = pd.DataFrame(datasets.load_wine().data)
df.columns = datasets.load_wine().feature_names
df.columns=df.columns.str.strip()
def random_dates(start, end, n, unit='D'):
ndays = (end - start).days 1
return pd.to_timedelta(np.random.rand(n) * ndays, unit=unit) start
np.random.seed(0)
start = pd.to_datetime('2015-01-01')
end = pd.to_datetime('2022-01-01')
datelist=random_dates(start, end, 178)
df['Dates'] = datelist
if you perform a simple query on hue
df.query('hue == 0.6')
you'll receive three rows with three random dates. Is it possible to pick the query result that's closest to let's say 2017-1-1?
so something like
df.query('hue==0.6').query('Date ~2017-1-1')
I hope this makes sense!
CodePudding user response:
Given a series, find the entries closest to a given date:
def closest_to_date(series, date, n=5):
date = pd.to_datetime(date)
return abs(series - date).nsmallest(n)
Then we can use the index of the returned series to select further rows (or you change the api to suit you):
(df.loc[df.hue == 0.6]
.loc[lambda df_: closest_to_date(df_.Dates, "2017-1-1", n=1).index]
)
CodePudding user response:
I'm not sure if you have to use query
, but this will give you the results you are looking for
df['Count'] = (df[df['hue'] == .6].sort_values(['Dates'], ascending=True)).groupby(['hue']).cumcount() 1
df.loc[df['Count'] == 1]
CodePudding user response:
You can use something like:
df.query("('2018-01-01' < Dates) & (Dates < '2018-01-31')")
# Output
alcohol malic_acid ... proline Dates
6 14.39 1.87 ... 1290.0 2018-01-24 08:21:14.665824000
41 13.41 3.84 ... 1035.0 2018-01-22 22:15:56.547561600
51 13.83 1.65 ... 1265.0 2018-01-26 22:37:26.812156800
131 12.88 2.99 ... 530.0 2018-01-01 18:58:05.118441600
139 12.84 2.96 ... 590.0 2018-01-08 13:38:26.117376000
142 13.52 3.17 ... 520.0 2018-01-19 22:37:10.170825600
[6 rows x 14 columns]
Or using @variables:
date = pd.to_datetime('2018-01-01')
offset = pd.DateOffset(days=10)
start = date - offset
end = date offset
df.query("Dates.between(@start, @end)")
# Output
alcohol malic_acid ... proline Dates
131 12.88 2.99 ... 530.0 2018-01-01 18:58:05.118441600
139 12.84 2.96 ... 590.0 2018-01-08 13:38:26.117376000