Hopefully, this is not a duplicate. I have two data frames: The first data frame has size n x m, and each cell contains a list of numeric values of size k. The second data frame has size n x k, and each cell contains a single numeric value (essentially a weight). What I need to do is find an efficient way to do the following: For each cell in the first data frame, multiply the list of values in that cell by the series of values in the second data frame that corresponds to the same row. Then, sum all of the products together.
Here is a simple example:
df1
has size 2x3, and each cell contains a list of 3 values. df2
has size 2x3, and each cell contains a single value.
In [3]: df1
Out[3]:
x y z
0 [1, 2, 3] [2, 3, 4] [3, 4, 5]
1 [4, 5, 6] [5, 6, 7] [6, 7, 8]
In [5]: df2
Out[5]:
first second third
0 1 1 1
1 2 2 2
The intermediate result after multiplying the lists in df1
by the corresponding row series from df2
is:
x y z
0 [1, 2, 3] [2, 3, 4] [3, 4, 5]
1 [8, 10, 12] [10, 12, 14] [12, 14, 16]
And the final result should be:
x y z
0 6 9 12
1 30 36 42
Right now I'm just using a nested for loop, which works but is incredibly inefficient (of course). I tried using itertuples()
, but I couldn't get that to work properly. Any help here is much appreciated!
CodePudding user response:
Try:
# Convert each list to numpy array if it's not already the case
df1 = df1.applymap(np.array)
vals = np.sum((df1.values * df2.values), axis=1)
out = pd.DataFrame.from_records(vals, index=df1.index, columns=df1.columns)
Output:
>>> out
x y z
0 6 9 12
1 30 36 42
# Intermediate result
>>> df1.values * df2.values
[[array([1, 2, 3]) array([2, 3, 4]) array([3, 4, 5])]
[array([ 8, 10, 12]) array([10, 12, 14]) array([12, 14, 16])]]