This is my first time posting here and I hope I am doing it right. I have searched the forum for my question, and even though there are related topics, none have addressed my specific question, so I hope you can help me.
I have a data frame with two columns that simplified looks like this:
Date Value
0 20010425 1
1 20010112 4
2 20010308 3
3 20010527 5
4 20010620 2
The first column represents dates in the yyyymmdd format, but is int64. Now I want to create an additional column, that takes the mean of the value of the previous 3 months individually for each row:
Date Value Mean_Value_past_3_months
0 20010425 1 3.5 #(4 3)/2
1 20010130 4 NaN # since no date earlier than 20010130 is available
2 20010308 3 4
3 20010527 5 2 # (1 3)/2
4 20010620 2 3 # (1 5)/2
And then I would like to drop NaN, so it does not count them moving forward.
I have tried to transpose the dates and then do, which is something I have found on here:
s = subset.columns[0:].values < df.values[:,None]
df['mean'] = (subset.iloc[:,0:]*t).mean(1)
but it a) it does not work yet, and b) does not include the 3 months period.
Thanks for your help!!
CodePudding user response:
It's a bit ambiguous how you want to define the "previous three months" but what you can do is the following:
import pandas as pd
df = pd.DataFrame({"Date": [20010425, 20010112, 20010308, 20010527, 20010620],
"Value": [1, 4, 3, 5, 2]
})
# define a custom function that computes the mean of the last three months
def get_mean(row):
# choose the correct dates here using the passed row parameter:
# e.g.
yyyymmdd_start = row["Date"] - 300
yyyymmdd_end = row["Date"]
selected_dates = df[(df['Date'] > yyyymmdd_start) & (df['Date'] < yyyymmdd_end)]
return selected_dates["Value"].mean()
df["Mean_Value_past_3_months"] = df.apply(get_mean, axis=1)
print(df)
Out[115]:
Date Value Mean_Value_past_3_months
0 20010425 1 3.0
1 20010112 4 NaN
2 20010308 3 4.0
3 20010527 5 2.0
4 20010620 2 3.0
CodePudding user response:
Another approach could be to convert the integer dates to datetime and use pandas.DataFrame.rolling capabilities like so:
import pandas as pd
df['Date'] = pd.to_datetime(dates, format='%Y%m%d')
# Sorting by Date, because rolling() need monotonic dates
df = df.sort_values('Date')
# Using approximately 30 days per month -> 90 day for 3 months, hence '90D'
df['Means'] = df.rolling('90D', on='Date', closed='left').mean()['Value']
print(df)
This will result in:
Date Value Means
1 2001-01-30 4 NaN
2 2001-03-08 3 4.0
0 2001-04-25 1 3.5
3 2001-05-27 5 2.0
4 2001-06-20 2 3.0