Pandas get mean value within relative time range by another value-CodePudding

I am working with a DataFrame with a DatetimeIndex and two additional columns, A and B, and trying to provide an output DataFrame to answer a question like:

Determine the average value B for each A in months 6-12 after earliest occurrence of A

I have been working with pd.Grouper and understand how to group DateTime index in buckets (e.g. df.groupby(pd.Grouper(freq='M')).mean()), but am unclear as to how to compute the average over some period since the earliest observation for each value A in the dataset.

The input DataFrame looks something like:

data = {
    'A': ['x', 'x', 'x', 'x', 'y', 'y', 'y', 'y', 'y'],
    'B': [10, 32, 12, 13, 24, 32, 12, 72, 90],
    'created_on': [
        '2018-01-31', 
        '2019-02-25', 
        '2018-02-12', 
        '2019-05-31', 
        '2021-03-12', 
        '2020-04-23', 
        '2016-01-11', 
        '2016-05-02', 
        '2018-12-31',
    ]
}

df = pd.DataFrame(data)
df = df.set_index(pd.to_datetime(df['created_on']))
df.drop(['created_on'], axis=1, inplace=True)

This generates a DataFrame like the following:


 ------------ --- ---- 
| created_on | A | B  |
 ------------ --- ---- 
| 2018-01-31 | x | 10 |
| 2019-02-25 | x | 32 |
| 2019-05-31 | x | 13 |
| 2021-03-12 | y | 24 |
| 2016-05-02 | y | 72 |
| ...        | . | .. |
 ------------ --- ----

The target is a desired output shaped like:

 --- ---------------------------------------------- 
| A | avg_B_6_12_months_after_earliest_observation |
 --- ---------------------------------------------- 
| x |                                         12.2 |
| y |                                         18.1 |
 --- ----------------------------------------------

The values in the avg_B_6_12_months_after_earliest_observation column above are for example only, they do not correlate to the values provided in the example input DataFrame.

CodePudding user response：

One idea is to use groupby.transform with idxmin to align per row the date of the first occurrence of each element in A. Then you can compare the index with the value of the first occurrence adding 6 or 12 months. Use this in a loc to select the wanted rows, groupby and mean.

# working dummy example
df = pd.DataFrame(
    {'A': ['x', 'x', 'x', 'x', 'y', 'y', 'y', 'y', 'y'],
     'B': [10, 32, 12, 13, 24, 32, 12, 72, 90],},
    index = pd.to_datetime(
        ['2018-01-31', '2018-02-25', '2018-08-12', 
         '2018-10-31', '2021-03-12', '2016-10-23', 
         '2016-01-11', '2016-05-02', '2016-12-31'])
)
print(df)

# helper series with aligned index and idxmin per A
s = df.groupby(['A'])['B'].transform('idxmin')
print(s)
# 2018-01-31   2018-01-31
# 2018-02-25   2018-01-31 # here first date if x align with this row
# 2018-08-12   2018-01-31
# 2018-10-31   2018-01-31
# 2021-03-12   2016-01-11
# 2016-10-23   2016-01-11
# 2016-01-11   2016-01-11
# 2016-05-02   2016-01-11
# 2016-12-31   2016-01-11
# Name: B, dtype: datetime64[ns]

Now you can get the result

res = (
    # select rows with date in the 6-12 months after 1rst occurrence
    df.loc[(s.index>=s pd.DateOffset(months=6)) 
          & (s.index<=s pd.DateOffset(months=12))]
       # groupby and mean
      .groupby('A')['B'].mean()
      # cosmetic to fit expected output
      .rename('avg_B_6_12_months_after_earliest_observation')
      .reset_index()
)
print(res)
#    A  avg_B_6_12_months_after_earliest_observation
# 0  x                                          12.5
# 1  y                                          61.0

CodePudding user response：

IIUC, you can define a custom function and apply it to pandas.DataFrame.groupby:

def filtersum(data):
    data = data.iloc[1:]
    ind = data.index[data.index.month.to_series().between(6, 12)]
    return data.loc[ind, "B"].mean()

new_df = df.sort_index().groupby("A", as_index=False).apply(filtersum)
print(new_df)

Output:

   A   NaN
0  x   NaN
1  y  90.0

Logic:

data.iloc[1:]: Excludes the first observation from the calculation
data.index[data.index.to_series().between(6, 12)]: filters the indices if their months are between 6 and 12 (inclusive).
df.sort_index().groupby: data must be sorted by their indices so that excluded first observation is indeed the chronological first.

Note (based on the sample data):

Customer x didn't spend anything between June and December:

            A   B
created_on       
2018-01-31  x  10
2018-02-12  x  12
2019-02-25  x  32
2019-05-31  x  13

Customer y only spent once in 2018-12-31:

            A   B
created_on       
2016-01-11  y  12
2016-05-02  y  72
2018-12-31  y  90 <<<
2020-04-23  y  32
2021-03-12  y  24