Pandas Query for date-CodePudding

I was looking through the pandas.query documentation but couldn't find anything specific about this. Is it possible to perform a query on a date based on the closest date to the one given, instead of a specific date?

For example lets say we use the wine dataset and creates some random dates.

    import pandas as pd
    import numpy as np
    from sklearn import datasets
    dir(datasets)
    df = pd.DataFrame(datasets.load_wine().data)
    df.columns = datasets.load_wine().feature_names
    df.columns=df.columns.str.strip()
    
    
    
    
    def random_dates(start, end, n, unit='D'):
        ndays = (end - start).days   1
        return pd.to_timedelta(np.random.rand(n) * ndays, unit=unit)   start
    
    
    
    np.random.seed(0)
    start = pd.to_datetime('2015-01-01')
    end = pd.to_datetime('2022-01-01')
    datelist=random_dates(start, end, 178)

    df['Dates'] = datelist

if you perform a simple query on hue

df.query('hue == 0.6')

you'll receive three rows with three random dates. Is it possible to pick the query result that's closest to let's say 2017-1-1?

so something like

df.query('hue==0.6').query('Date ~2017-1-1')

I hope this makes sense!

CodePudding user response：

Given a series, find the entries closest to a given date:

def closest_to_date(series, date, n=5):
    date = pd.to_datetime(date)
    return abs(series - date).nsmallest(n)

Then we can use the index of the returned series to select further rows (or you change the api to suit you):

(df.loc[df.hue == 0.6]
 .loc[lambda df_: closest_to_date(df_.Dates, "2017-1-1", n=1).index]
)

CodePudding user response：

I'm not sure if you have to use query, but this will give you the results you are looking for

df['Count'] = (df[df['hue'] == .6].sort_values(['Dates'], ascending=True)).groupby(['hue']).cumcount()   1
df.loc[df['Count'] == 1]

CodePudding user response：

You can use something like:

df.query("('2018-01-01' < Dates) & (Dates < '2018-01-31')")

# Output
     alcohol  malic_acid  ...  proline                         Dates
6      14.39        1.87  ...   1290.0 2018-01-24 08:21:14.665824000
41     13.41        3.84  ...   1035.0 2018-01-22 22:15:56.547561600
51     13.83        1.65  ...   1265.0 2018-01-26 22:37:26.812156800
131    12.88        2.99  ...    530.0 2018-01-01 18:58:05.118441600
139    12.84        2.96  ...    590.0 2018-01-08 13:38:26.117376000
142    13.52        3.17  ...    520.0 2018-01-19 22:37:10.170825600

[6 rows x 14 columns]

Or using @variables:

date = pd.to_datetime('2018-01-01')
offset = pd.DateOffset(days=10)
start = date - offset
end = date   offset
df.query("Dates.between(@start, @end)")

# Output
     alcohol  malic_acid  ...  proline                         Dates
131    12.88        2.99  ...    530.0 2018-01-01 18:58:05.118441600
139    12.84        2.96  ...    590.0 2018-01-08 13:38:26.117376000