Number of rows between two dates-CodePudding

Let's say I have a pandas df with a Date column (datetime64[ns]):

          Date  rows_num
0   2020-01-01       NaN
1   2020-02-25       NaN
2   2020-04-23       NaN
3   2020-06-28       NaN
4   2020-08-17       NaN
5   2020-10-11       NaN
6   2020-12-06       NaN
7   2021-01-26       7.0
8   2021-03-17       7.0

I want to get a column (rows_num in the above example) with the number of rows I need to claw back to find the current row date minus 365 days (1 year before).

So, in the above example, for index 7 (date 2021-01-26) I want to know how many rows before I can find the date 2020-01-26.

If a perfect match is not available (like in the example df), I should reference the closest available date (or the closest smaller/larger date: it doesn't really matter in my case).

Any idea? Thanks

CodePudding user response：

Edited to reflect OP's original question. Created a demo dataframe. Created a column to hold that row_count value to reflect number of business days. Then, for each row, create a filter to grab all rows between the start date and 365 days later. the shape[0] of that filtered dataframe represents the number of business days, and we add it into the appropriate field of the df.

# Import Pandas package
import pandas as pd
from datetime import datetime, timedelta

# Create a sample dataframe
df = pd.DataFrame({'num_posts': [4, 6, 3, 9, 1, 14, 2, 5, 7, 2],
                   'date' : ['2020-08-09', '2020-08-25', '2020-09-05',
                            '2020-09-12', '2020-09-29', '2020-10-15',
                            '2020-11-21', '2020-12-02', '2020-12-10',
                            '2020-12-18']})

#create the column for the row count:
df.insert(2, "row_count", '')

# Convert the date to datetime64
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')

for row in range(len(df['date'])):
    start_date = str(df['date'].iloc[row])
    end_date = str(df['date'].iloc[row]   timedelta(days=365)) #set the end date for the filter

    # Filter data between two dates
    filtered_df = df.loc[(df['date'] >= start_date) & (df['date'] < end_date)]

    df['row_count'][row] = filtered_df.shape[0] # fill in the row_count column with the number of rows returned by your filter

source

CodePudding user response：

You can use pd.merge_asof, which performs the exact nearest-match lookup you describe. You can even choose to use backward (smaller), forward (larger), or nearest search types.

# setup
text = StringIO(
    """
      Date
2020-01-01
2020-02-25
2020-04-23
2020-06-28
2020-08-17
2020-10-11
2020-12-06
2021-01-26
2021-03-17
"""
)

data = pd.read_csv(text, delim_whitespace=True, parse_dates=["Date"])

# calculate the reference date from 1 year (365 days) ago
one_year_ago = data["Date"] - pd.Timedelta("365D")

# we only care about the index values for the original and matched dates
merged = pd.merge_asof(
    one_year_ago.reset_index(),
    data.reset_index(),
    on="Date",
    suffixes=("_original", "_matched"),
    direction="backward",
)

data["rows_num"] = merged["index_original"] - merged["index_matched"]

Result:

          Date  rows_num
0   2020-01-01       NaN
1   2020-02-25       NaN
2   2020-04-23       NaN
3   2020-06-28       NaN
4   2020-08-17       NaN
5   2020-10-11       NaN
6   2020-12-06       NaN
7   2021-01-26       7.0
8   2021-03-17       7.0