Let's say I have a pandas df with a Date column (datetime64[ns]):
Date rows_num
0 2020-01-01 NaN
1 2020-02-25 NaN
2 2020-04-23 NaN
3 2020-06-28 NaN
4 2020-08-17 NaN
5 2020-10-11 NaN
6 2020-12-06 NaN
7 2021-01-26 7.0
8 2021-03-17 7.0
I want to get a column (rows_num in the above example) with the number of rows I need to claw back to find the current row date minus 365 days (1 year before).
So, in the above example, for index 7 (date 2021-01-26) I want to know how many rows before I can find the date 2020-01-26.
If a perfect match is not available (like in the example df), I should reference the closest available date (or the closest smaller/larger date: it doesn't really matter in my case).
Any idea? Thanks
CodePudding user response:
Edited to reflect OP's original question. Created a demo dataframe. Created a column to hold that row_count value to reflect number of business days. Then, for each row, create a filter to grab all rows between the start date and 365 days later. the shape[0] of that filtered dataframe represents the number of business days, and we add it into the appropriate field of the df.
# Import Pandas package
import pandas as pd
from datetime import datetime, timedelta
# Create a sample dataframe
df = pd.DataFrame({'num_posts': [4, 6, 3, 9, 1, 14, 2, 5, 7, 2],
'date' : ['2020-08-09', '2020-08-25', '2020-09-05',
'2020-09-12', '2020-09-29', '2020-10-15',
'2020-11-21', '2020-12-02', '2020-12-10',
'2020-12-18']})
#create the column for the row count:
df.insert(2, "row_count", '')
# Convert the date to datetime64
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')
for row in range(len(df['date'])):
start_date = str(df['date'].iloc[row])
end_date = str(df['date'].iloc[row] timedelta(days=365)) #set the end date for the filter
# Filter data between two dates
filtered_df = df.loc[(df['date'] >= start_date) & (df['date'] < end_date)]
df['row_count'][row] = filtered_df.shape[0] # fill in the row_count column with the number of rows returned by your filter
CodePudding user response:
You can use pd.merge_asof
, which performs the exact nearest-match lookup you describe. You can even choose to use backward (smaller), forward (larger), or nearest search types.
# setup
text = StringIO(
"""
Date
2020-01-01
2020-02-25
2020-04-23
2020-06-28
2020-08-17
2020-10-11
2020-12-06
2021-01-26
2021-03-17
"""
)
data = pd.read_csv(text, delim_whitespace=True, parse_dates=["Date"])
# calculate the reference date from 1 year (365 days) ago
one_year_ago = data["Date"] - pd.Timedelta("365D")
# we only care about the index values for the original and matched dates
merged = pd.merge_asof(
one_year_ago.reset_index(),
data.reset_index(),
on="Date",
suffixes=("_original", "_matched"),
direction="backward",
)
data["rows_num"] = merged["index_original"] - merged["index_matched"]
Result:
Date rows_num
0 2020-01-01 NaN
1 2020-02-25 NaN
2 2020-04-23 NaN
3 2020-06-28 NaN
4 2020-08-17 NaN
5 2020-10-11 NaN
6 2020-12-06 NaN
7 2021-01-26 7.0
8 2021-03-17 7.0