Pandas group rows with the same column values and apply function to the first row then assignment re-CodePudding

I have a df, the code is:

import pandas as pd
from io import StringIO
    
      df = """
  ValOption  RB test
0       SLA  4  3 
1       AC   5  4
2       SLA  5  5
3       AC   2  4
4       SLA  5  5
5       AC   5  4
6       SLA  4  3

"""
df = pd.read_csv(StringIO(df.strip()), sep='\s ')

Out put:

  ValOption  RB  test
0       SLA   4     3
1        AC   5     4
2       SLA   5     5
3        AC   2     4
4       SLA   5     5
5        AC   5     4
6       SLA   4     3

Now I want to group the rows that have the same 'ValOption' and 'RB' together:

df.sort_values(['ValOption', 'RB']).set_index(['ValOption', 'RB'])

out put:

ValOption   RB  test
AC          2   4
            5   4
                4
SLA        4    3
                3
           5    5
                5

Notice:since there are many many rows and the value of RB is dynamic so I can't use something like:

df.loc[df['ValOption']=='AC'&df['RB']==5]

Now I want to apply a function to each row:

    def func(row):
        v1=row['RB']*3
        v2=row['test']-1
        return v1 v2
    df['new_col']==df.apply(lambda row:func(row), axis=1)

Out put:

ValOption   RB  test    new_col
0   SLA        4    3   14
1   AC         5    4   18
2   SLA        5    5   19
3   AC         2    4   9
4   SLA        5    5   19
5   AC         5    4   18
6   SLA        4    3   14

However, in my real business the size is very large and extreme complicated,that's the reason I only use pd.apply function to apply the function to each row,I know this the very slow method compare to use numpy:

def func():
   v1=df['RB'].values*3
   v2=df['test'].values-1
   return v1 v2
df['new_col']=func(df)

This code will get the same result and faster ,but my function is too complicated I can only use pands.apply function ,I have tried 2 weeks.

So my question is, after I grouped the rows with same column values, how can I apply function to the group and only apply to the first row, so that I don't need calculate repeatedly, because the variables are the same ,instead I directly assignment the first rows calculation result to other rows that have the same column values.

The final goal is to save time when program running large size files.

The code is something like:

new_value_of_first_row=group['SLA'].each_group(group:RB==4 and group:RB==5).first_row.apply(lambda row: func(row),axis=1)


new_value_of_other_rows = new_value_of_first_row

CodePudding user response：

IIUC, you want to apply a function only on the first item of each group. You could create a mask (with groupby cumcount) and then use where (or mask) to assign the output of your function on the selected rows:

df['RB_new'] = df['RB'].where(df.groupby(['ValOption', 'RB']).cumcount().ne(0),
                              df['RB']*3 # to replace with your (vector) function
                              )

output (in a new column for clarity):

  ValOption  RB  RB_new
0       SLA   4      12
1        AC   5      15
2       SLA   5      15
3        PG   5      15
4       SLA   5       5
5        PC   4      12
6       SLA   4       4
7        AC   5       5
8        PC   4       4

Alternative (useful for non-vectorial functions):

def func(s):
    s = s.copy()
    s.iloc[0] *= 3
    return s

df['RB_new'] = (df.groupby(['ValOption', 'RB'],
                           as_index=False, sort=False)['RB']
                  .transform(func)
                )

CodePudding user response：

Here is a way to remove duplicates on a dataframe, perform a computation, and then restore the duplicates to the original shape.

For this, it saves the index of the first row per group, then drop_duplicates, and after computation, reindex:

# save indexer
idx = df.groupby(['ValOption', 'RB', 'test'])['ValOption'].transform(lambda s: s.index[0])

# drop duplicates
df2 = df.drop_duplicates().copy()

# perform computation (to replace with the actual function, eventually with apply)
df2['newcol'] = df2['RB']*3 df2['test']-1

# reindex to original shape
df2.reindex(idx).reset_index(drop=True)