Iterating row-wise over 2 pandas dataframes and passing these vectors as args to function-CodePudding

I'd like to iterate row-wise over 2 identically-shaped dataframes, passing the rows from each as vectors to a function without using loops. Essentially something similar to R's mapply.

I've investigated a little and the best that I've seen uses map in a list comprehension, but I'm not doing it correctly. Even if we get this to work, though, it seems a bit clunky - is there a more elegant way to do this? Seems like this should be a functionality in pandas.

import numpy as np
import pandas as pd
from scipy import stats

df1 = pd.DataFrame(np.random.randn(3,3))
df2 = pd.DataFrame(np.random.randn(3,3))

sd_array = np.array([0.02, 0.015, 0.2])

def helper_func(x, y):
   return stats.norm.pdf(x, loc=y, scale=sd_array).prod()

res_lst = []
row_cnt = df1.shape[0]

res = [list(map(helper_func, df1.iloc[i,:], df2.iloc[i,:])) for i in range(row_cnt)]
res_lst.append(res)

The way I currently have it written doesn't give an error but also doesn't return what I want. I should only have 3 values in the output, one for each row of the dataframe.

CodePudding user response：

You can just do helper_func(df1, df2), and in helper_func: return stats.norm.pdf(x, loc=y, scale=sd_array).prod(axis=1). Be aware that your scale is such, that the values returned are almost always 0. Using scale=100*sd_array in the PDF will at least show some non-zero values.

In fact, you don't need a dataframe in this example:

import numpy as np
from scipy import stats

np.random.seed(1)

data1 = np.random.randn(3,3)
data2 = np.random.randn(3,3)

sd_array = np.array([0.02, 0.015, 0.2])

C = 100  # for demonstration purposes
def helper_func(x, y):
    return stats.norm.pdf(x, loc=y, scale=C*sd_array).prod(axis=1)

res = helper_func(data1, data2)
print(res)

yields

array([0.0002616 , 0.00068695, 0.00035566])

But when using a dataframe instead of data1 or data2, NumPy/Pandas/Scipy are flexible enough to recognize the 2D array of values and use it as such.

CodePudding user response：

The problem with your implementation is that you are iterating over the rows and also applying helper_func to each element in each row using map. So the first call to helper_func is helpoer_func(df1.iloc[i, 0], df2.iloc[i, 0]), not on the first row.

You can fix your implementation by removing the inner loop:

res = [helper_func(df1.iloc[i,:], df2.iloc[i,:]) for i in range(row_cnt)]

CodePudding user response：

I prefer the method in my other answer, using NumPy vector-wise calculations, with .prod(axis=1), but to answer the question in the title: you can use zip and .iterrows():

assert len(df1) == len(df2)  # just to check
res = [helper_func(row1, row2) for (_, row1), (_, row2) in 
       zip(df1.iterrows(), df2.iterrows())]

(this requires the original helper_func, without axis=1 in the .prod() method.)

You need the underscores to ignore the indices that come with .iterrows() (it functions as enumerate() in standard Python).