Home > database >  How to get average between first row and current row per each group in data frame?
How to get average between first row and current row per each group in data frame?

Time:09-05

i have data frame like this,

id value
a 2
a 4
a 3
a 5
b 1
b 4
b 3
c 1
c nan
c 5

the resulted data frame contain new column ['average'] and to get its values will be:

  • make group-by(id)
  • first row in 'average' column per each group is equal to its corresponding value in 'value'
  • other rows in ' average' in group is equal to mean for all previous rows in 'value'(except current value)
  • for more explain in group of (a) the "average' of index

the resulted data frame must be :

id value average
a 2 2
a 4 2
a 3 3
a 5 3
b 1 1
b 4 1
b 3 2.5
c 1 1
c nan 1
c 5 1

CodePudding user response:

You can group the dataframe by id, then calculate the expanding mean for value column for each groups, then shift the expanding mean and get it back to the original dataframe, once you have it, you just need to ffill on axis=1 on for the value and average columns to get the first value for the categories:

out = (df
       .assign(average=df
               .groupby(['id'])['value']
               .transform(lambda x: x.expanding().mean().shift(1))
               )
       )
out[['value', 'average']] = out[['value', 'average']].ffill(axis=1)

OUTPUT:

  id  value  average
0  a    2.0      2.0
1  a    4.0      2.0
2  a    3.0      3.0
3  a    5.0      3.0
4  b    1.0      1.0
5  b    4.0      1.0
6  b    3.0      2.5
7  c    1.0      1.0
8  c    NaN      1.0
9  c    5.0      1.0

CodePudding user response:

Here is a solution which, I think, satisfies the requirements. Here, the first row in a group of ids is simply passing its value to the average column. For every other row, we take the average where the index is smaller than the current index.

You may want to specify how you want to handle the NaN values. In the below, I set them to None so that they are ignored.

import numpy as np
from numpy import average
import pandas as pd

df = pd.DataFrame([
    ['a', 2],
    ['a', 4],
    ['a', 3],
    ['a', 5],
    ['b', 1],
    ['b', 4],
    ['b', 3],
    ['c', 1],
    ['c', np.NAN],
    ['c', 5]
], columns=['id', 'value'])

# Replace the NaN value with None
df['value'] = df['value'].replace(np.nan, None)

id_groups = df.groupby(['id'])

id_level_frames = []

for group, frame in id_groups:
    print(group)

    # Resets the index for each id-level frame
    frame = frame.reset_index()

    for index, row in frame.iterrows():
        # If this is the first row:
        if index== 0:
            frame.at[index, 'average'] = row['value']

        else:
            current_index = index
            earlier_rows = frame[frame.index < index]
            frame.at[index, 'average'] = average(earlier_rows['value'])

    id_level_frames.append(frame)

    final_df = pd.concat(id_level_frames)
  • Related