I have a df, the code is:
import pandas as pd
from io import StringIO
df = """
ValOption RB test
0 SLA 4 3
1 AC 5 4
2 SLA 5 5
3 AC 2 4
4 SLA 5 5
5 AC 5 4
6 SLA 4 3
"""
df = pd.read_csv(StringIO(df.strip()), sep='\s ')
Out put:
ValOption RB test
0 SLA 4 3
1 AC 5 4
2 SLA 5 5
3 AC 2 4
4 SLA 5 5
5 AC 5 4
6 SLA 4 3
Now I want to group the rows that have the same 'ValOption' and 'RB' together:
df.sort_values(['ValOption', 'RB']).set_index(['ValOption', 'RB'])
out put:
ValOption RB test
AC 2 4
5 4
4
SLA 4 3
3
5 5
5
Notice:since there are many many rows and the value of RB is dynamic so I can't use something like:
df.loc[df['ValOption']=='AC'&df['RB']==5]
Now I want to apply a function to each row:
def func(row):
v1=row['RB']*3
v2=row['test']-1
return v1 v2
df['new_col']==df.apply(lambda row:func(row), axis=1)
Out put:
ValOption RB test new_col
0 SLA 4 3 14
1 AC 5 4 18
2 SLA 5 5 19
3 AC 2 4 9
4 SLA 5 5 19
5 AC 5 4 18
6 SLA 4 3 14
However, in my real business the size is very large and extreme complicated,that's the reason I only use pd.apply function to apply the function to each row,I know this the very slow method compare to use numpy:
def func():
v1=df['RB'].values*3
v2=df['test'].values-1
return v1 v2
df['new_col']=func(df)
This code will get the same result and faster ,but my function is too complicated I can only use pands.apply function ,I have tried 2 weeks.
So my question is, after I grouped the rows with same column values, how can I apply function to the group and only apply to the first row, so that I don't need calculate repeatedly, because the variables are the same ,instead I directly assignment the first rows calculation result to other rows that have the same column values.
The final goal is to save time when program running large size files.
The code is something like:
new_value_of_first_row=group['SLA'].each_group(group:RB==4 and group:RB==5).first_row.apply(lambda row: func(row),axis=1)
new_value_of_other_rows = new_value_of_first_row
CodePudding user response:
IIUC, you want to apply a function only on the first item of each group.
You could create a mask (with groupby
cumcount
) and then use where
(or mask
) to assign the output of your function on the selected rows:
df['RB_new'] = df['RB'].where(df.groupby(['ValOption', 'RB']).cumcount().ne(0),
df['RB']*3 # to replace with your (vector) function
)
output (in a new column for clarity):
ValOption RB RB_new
0 SLA 4 12
1 AC 5 15
2 SLA 5 15
3 PG 5 15
4 SLA 5 5
5 PC 4 12
6 SLA 4 4
7 AC 5 5
8 PC 4 4
Alternative (useful for non-vectorial functions):
def func(s):
s = s.copy()
s.iloc[0] *= 3
return s
df['RB_new'] = (df.groupby(['ValOption', 'RB'],
as_index=False, sort=False)['RB']
.transform(func)
)
CodePudding user response:
Here is a way to remove duplicates on a dataframe, perform a computation, and then restore the duplicates to the original shape.
For this, it saves the index of the first row per group, then drop_duplicates
, and after computation, reindex
:
# save indexer
idx = df.groupby(['ValOption', 'RB', 'test'])['ValOption'].transform(lambda s: s.index[0])
# drop duplicates
df2 = df.drop_duplicates().copy()
# perform computation (to replace with the actual function, eventually with apply)
df2['newcol'] = df2['RB']*3 df2['test']-1
# reindex to original shape
df2.reindex(idx).reset_index(drop=True)