Home > Mobile >  Pandas find mean of values for specific time period
Pandas find mean of values for specific time period

Time:12-02

This is my first time posting here and I hope I am doing it right. I have searched the forum for my question, and even though there are related topics, none have addressed my specific question, so I hope you can help me.

I have a data frame with two columns that simplified looks like this:

    Date         Value
0   20010425     1
1   20010112     4
2   20010308     3
3   20010527     5
4   20010620     2

The first column represents dates in the yyyymmdd format, but is int64. Now I want to create an additional column, that takes the mean of the value of the previous 3 months individually for each row:

      Date       Value     Mean_Value_past_3_months
0   20010425     1         3.5                      #(4 3)/2
1   20010130     4         NaN                      # since no date earlier than 20010130 is available 
2   20010308     3         4                        
3   20010527     5         2                        # (1 3)/2
4   20010620     2         3                        # (1 5)/2

And then I would like to drop NaN, so it does not count them moving forward.

I have tried to transpose the dates and then do, which is something I have found on here:

s = subset.columns[0:].values < df.values[:,None]
df['mean'] = (subset.iloc[:,0:]*t).mean(1)

but it a) it does not work yet, and b) does not include the 3 months period.

Thanks for your help!!

CodePudding user response:

It's a bit ambiguous how you want to define the "previous three months" but what you can do is the following:

import pandas as pd

df = pd.DataFrame({"Date": [20010425, 20010112, 20010308, 20010527, 20010620],
                   "Value": [1, 4, 3, 5, 2]
                   })
# define a custom function that computes the mean of the last three months
def get_mean(row):
    # choose the correct dates here using the passed row parameter:
    # e.g.
    yyyymmdd_start = row["Date"] - 300
    yyyymmdd_end = row["Date"]

    selected_dates = df[(df['Date'] > yyyymmdd_start) & (df['Date'] < yyyymmdd_end)]

    return selected_dates["Value"].mean()

df["Mean_Value_past_3_months"] = df.apply(get_mean, axis=1)

print(df)
Out[115]: 
       Date  Value  Mean_Value_past_3_months
0  20010425      1                       3.0
1  20010112      4                       NaN
2  20010308      3                       4.0
3  20010527      5                       2.0
4  20010620      2                       3.0

CodePudding user response:

Another approach could be to convert the integer dates to datetime and use pandas.DataFrame.rolling capabilities like so:

import pandas as pd

df['Date'] = pd.to_datetime(dates, format='%Y%m%d')

# Sorting by Date, because rolling() need monotonic dates
df = df.sort_values('Date')

# Using approximately 30 days per month -> 90 day for 3 months, hence '90D'
df['Means'] = df.rolling('90D', on='Date', closed='left').mean()['Value']
print(df)

This will result in:

        Date  Value  Means
1 2001-01-30      4    NaN
2 2001-03-08      3    4.0
0 2001-04-25      1    3.5
3 2001-05-27      5    2.0
4 2001-06-20      2    3.0
  • Related