Pandas apply custom function to each dataframe row and append results-CodePudding

How can I apply a custom function to each row of a Pandas dataframe df1, where:

the function uses values from a column in df1
the function uses values from another dataframe df2
the results are appended to df1 column-wise

Example:

df1 = pd.DataFrame([1, 2, 3], columns=["x"])

df2 = pd.DataFrame({"set1": [0, 0, 0, 0], "set2": [100, 200, 300, 400]})

display(df1, df2)

And custom function

def myfunc(df2, x=df1["x"]):
    # Something simple but custom
    ans = df2["set1"]   df2["set2"] * x
    return ans

Desired output is

	x	run1	run2	run3	run4
0	1	100	200	300	400
1	2	200	400	600	800
2	3	300	600	900	1200

Here is an example function call; but how can I apply it with a oneliner to get the desired dataframe output?

test = myfunc(df2,x=3)
print(test)

CodePudding user response：

You can do

df1 = df1.join(df1.apply(lambda x  : myfunc(df2, x['x']),axis=1))
Out[152]: 
   x    0    1    2     3
0  1  100  200  300   400
1  2  200  400  600   800
2  3  300  600  900  1200

CodePudding user response：

This is specific to your example myfunc but it is possible to vectorize with dot

df1[['x']].dot(
    df2['set1'].add(df2['set2']).to_frame().T.values
).rename(
    columns={i:f'run{i 1}' for i in df2.index}
).assign(
    x = df1['x'],
)

CodePudding user response：

If you really need a custom function, you can use apply:

# Modified slightly to make using it easier~
def myfunc(x, df2):
    return df2["set1"]   df2["set2"] * x

df1 = df1.join(df1.x.apply(myfunc, args=(df2,)).add_prefix('run'))
print(df1)

# Output:

   x  run0  run1  run2  run3
0  1   100   200   300   400
1  2   200   400   600   800
2  3   300   600   900  1200

That said, there's often a way to do whatever you want to do using pandas methods:

df = df1.merge(df2, 'cross')
df['value'] = df.set1   df.set2 * df.x
df['run'] = df.groupby('x')['value'].cumcount()   1
df = df.pivot(index='x', columns='run', values='value')
df.columns = [f'{df.columns.name}{x}' for x in df.columns]
print(df.reset_index())

# Output:

   x  run1  run2  run3  run4
0  1   100   200   300   400
1  2   200   400   600   800
2  3   300   600   900  1200