Home > Software engineering >  Iterating row-wise over 2 pandas dataframes and passing these vectors as args to function
Iterating row-wise over 2 pandas dataframes and passing these vectors as args to function

Time:09-21

I'd like to iterate row-wise over 2 identically-shaped dataframes, passing the rows from each as vectors to a function without using loops. Essentially something similar to R's mapply.

I've investigated a little and the best that I've seen uses map in a list comprehension, but I'm not doing it correctly. Even if we get this to work, though, it seems a bit clunky - is there a more elegant way to do this? Seems like this should be a functionality in pandas.

import numpy as np
import pandas as pd
from scipy import stats

df1 = pd.DataFrame(np.random.randn(3,3))
df2 = pd.DataFrame(np.random.randn(3,3))

sd_array = np.array([0.02, 0.015, 0.2])

def helper_func(x, y):
   return stats.norm.pdf(x, loc=y, scale=sd_array).prod()

res_lst = []
row_cnt = df1.shape[0]

res = [list(map(helper_func, df1.iloc[i,:], df2.iloc[i,:])) for i in range(row_cnt)]
res_lst.append(res)

The way I currently have it written doesn't give an error but also doesn't return what I want. I should only have 3 values in the output, one for each row of the dataframe.

CodePudding user response:

You can just do helper_func(df1, df2), and in helper_func: return stats.norm.pdf(x, loc=y, scale=sd_array).prod(axis=1). Be aware that your scale is such, that the values returned are almost always 0. Using scale=100*sd_array in the PDF will at least show some non-zero values.

In fact, you don't need a dataframe in this example:

import numpy as np
from scipy import stats

np.random.seed(1)

data1 = np.random.randn(3,3)
data2 = np.random.randn(3,3)

sd_array = np.array([0.02, 0.015, 0.2])

C = 100  # for demonstration purposes
def helper_func(x, y):
    return stats.norm.pdf(x, loc=y, scale=C*sd_array).prod(axis=1)

res = helper_func(data1, data2)
print(res)

yields

array([0.0002616 , 0.00068695, 0.00035566])

But when using a dataframe instead of data1 or data2, NumPy/Pandas/Scipy are flexible enough to recognize the 2D array of values and use it as such.

CodePudding user response:

The problem with your implementation is that you are iterating over the rows and also applying helper_func to each element in each row using map. So the first call to helper_func is helpoer_func(df1.iloc[i, 0], df2.iloc[i, 0]), not on the first row.

You can fix your implementation by removing the inner loop:

res = [helper_func(df1.iloc[i,:], df2.iloc[i,:]) for i in range(row_cnt)]

CodePudding user response:

I prefer the method in my other answer, using NumPy vector-wise calculations, with .prod(axis=1), but to answer the question in the title: you can use zip and .iterrows():

assert len(df1) == len(df2)  # just to check
res = [helper_func(row1, row2) for (_, row1), (_, row2) in 
       zip(df1.iterrows(), df2.iterrows())]

(this requires the original helper_func, without axis=1 in the .prod() method.)

You need the underscores to ignore the indices that come with .iterrows() (it functions as enumerate() in standard Python).

  • Related