The problem is as follows.
I have a pandas dataFrame looking something like
row_idx clm_idx value
0 0 1 a1
1 0 2 b1
2 1 3 c1
3 2 3 d1
This dataFrame can have up to m lines (probably quite a lot).
Secondly I have a nxn numpy array A where I need to add those values to depending on the columns 'row_idx' and 'clm_idx' which describe the row and column index in the numpy array. The mathematical function can even be more complex than simply adding. But I set that aside for now.
A = [[a b c d],
[e f g h],
[i k l m],
[n o p q]]
Here as an example its a 4x4 matrix. So I would like to get the following in the end:
A_new = [[a b a1 c b1 d],
[e f g h c1],
[i k l m d1],
[n o p q]]
I assume I can iterate over all rows of the dataframe somehow extract the indices and then add the values at the corresponding index to the nxn array. But that seems somewhat inefficient.
I tried as well:
df.apply(lambda x: A[x['row_idx'], x['clm_idx']] = x['value'])
But this throws a SyntaxError that one cannot contain an assignment in this context.
Is there an efficient way to solve the problem, assuming that there m and n are quite big?
As a basic code block this should do. The problem is the last line:
import numpy as np
import pandas as pd
data = {'row_idx': [0, 0, 1, 2], 'clm_idx': [1,2,3,3], 'value': [1,2,3,4]}
df = pd.DataFrame(data)
A = np.zeros((4,4))
df.apply(lambda x: A[x['row_idx'], x['clm_idx']] = x['value'], axis=1)
CodePudding user response:
Numpy is able to do indexing directly, you don't need any pandas pre-processing:
A = np.zeros((4, 4))
A[df['row_idx'], df['clm_idx']] = df['value']
Output:
array([[0., 1., 2., 0.],
[0., 0., 0., 3.],
[0., 0., 0., 4.],
[0., 0., 0., 0.]])
Other example:
A = np.ones((4, 4)) * 100
A[df['row_idx'], df['clm_idx']] = df['value']
array([[100., 101., 102., 100.],
[100., 100., 100., 103.],
[100., 100., 100., 104.],
[100., 100., 100., 100.]])
CodePudding user response:
example:
data = [[0, 1, 100], [0, 2, 200], [1, 3, 300], [2, 3, 400]]
df = pd.DataFrame(data, columns=['row_idx', 'clm_idx', 'value'])
df
row_idx clm_idx value
0 0 1 100
1 0 2 200
2 1 3 300
3 2 3 400
a1 = np.arange(0, 16).reshape(4,4)
a1
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]])
use apply
with axis=1
def func(x):
a1[x['row_idx'], x['clm_idx']] = x['value']
df.apply(func , axis=1)
output(a1):
array([[ 0, 101, 202, 3],
[ 4, 5, 6, 307],
[ 8, 9, 10, 411],
[ 12, 13, 14, 15]])
CodePudding user response:
First for performant solution never use apply
, it is really slow.
Use DataFrame.pivot
with DataFrame.reindex
for DataFrame with same shape like A
and then add values:
data = {'row_idx': [0, 0, 1, 2], 'clm_idx': [1,2,3,3], 'value': [1,2,3,4]}
df = pd.DataFrame(data)
A = np.zeros((4,4))
df = (df.pivot('row_idx','clm_idx','value')
.reindex(index=range(A.shape[0]), columns=range(A.shape[0])).fillna(0))
#if duplicated row_idx, clm_idx pairs
#df = (df.pivot_table(index='row_idx',columns='clm_idx',values='value', aggfunc='sum')
# .reindex(index=range(A.shape[0]), columns=range(A.shape[0])).fillna(0))
print (df)
clm_idx 0 1 2 3
row_idx
0 0.0 1.0 2.0 0.0
1 0.0 0.0 0.0 3.0
2 0.0 0.0 0.0 4.0
3 0.0 0.0 0.0 0.0
A = A df.to_numpy()
print (A)
[[0. 1. 2. 0.]
[0. 0. 0. 3.]
[0. 0. 0. 4.]
[0. 0. 0. 0.]]