Home > Back-end >  Series of operations on each group of a DataFrame
Series of operations on each group of a DataFrame

Time:07-27

I have the following dataframe:

from numpy import tile

group = np.repeat(['A','B'],10)
number = np.tile(range(0,10),2)

df = pd.DataFrame({
    'group': group,
    'number' : number,
    'value' : np.random.rand(len(number))
})

and I want to create a new column where I perform a series of operations for each group, but I'm running into all sort of problems and my code is looking very clumsy.

The end goal is the following:

  • For each group and number 0, df['New'] = 1, or any other constant number K
  • For each group and number 1 to 9, df['New'] = df['New' - 1] * ( 1 - df['value' - 1] ), where the value is taken from the row above, which is what I mean by the "- 1" inside the brackets.
  • For each group a new row is added, in this case corresponding to number 9 1 = 10, so that the operation above can be included as well.

So far what I've managed is the following:

df = df.set_index(['group', 'number'])

df['Constant'] = 1
df['New'] = df['Constant'] * (1 - df['value'])

def f(x):
    x.loc[('', 10), :] = ''
    return x

df = df.groupby(level=0, group_keys=False).apply(f)

df['New'] = df.groupby('group').New.shift(1)
            

But here the shift operation is not working for me, and I still need to preserve the value of the constant in the first position for df['New'] instead of NaN from shifting.

Any pointers and ways to clean up this code are greatly appreaciated.

Edit: A simpler example would be like the following: enter image description here

CodePudding user response:

For each of the group, you can iterate through the rows in group and set the row value from previous rows.

In the below code, i is the index within each group and group.iloc[i].name gives you the index value corresponding to the original dataframe.

K = 1 # YOUR CONSTANT
df['new'] = K
def func(group):
    for i in range(1, len(group)):
        df.loc[group.iloc[i].name, 'new'] = df.iloc[group.iloc[i-1].name].new * (1 - group.iloc[i-1].value)
    
df.groupby('group').apply(func)

which gives us the expected output :

df = pd.DataFrame({
    'group': ['A', 'A', 'A', 'A'],
    'number' : [0, 1, 2, 3],
    'value' : [0.5, 0.4, 0.3, 0]
})

  group  number  value   new
0     A       0    0.5  1.00
1     A       1    0.4  0.50
2     A       2    0.3  0.30
3     A       3    0.0  0.21

Also for the below values of group, number and new the dataframe would be

   group  number     value       new
0      A       0  0.311951  1.000000
1      A       1  0.022941  0.688049
2      A       2  0.174398  0.672264
3      A       3  0.299853  0.555022
4      A       4  0.725469  0.388597
5      A       5  0.730307  0.106682
6      A       6  0.554905  0.028771
7      A       7  0.815290  0.012806
8      A       8  0.816718  0.002365
9      A       9  0.011935  0.000434
10     B       0  0.153680  1.000000
11     B       1  0.229228  0.846320
12     B       2  0.542225  0.652320
13     B       3  0.219170  0.298616
14     B       4  0.628088  0.233168
15     B       5  0.396675  0.086718
16     B       6  0.646968  0.052319
17     B       7  0.380830  0.018470
18     B       8  0.837341  0.011436
19     B       9  0.531990  0.001860
  • Related