Numpy: Perform an "Upsert" by adding columns based on others-CodePudding

I have two numpy arrays, x and y. They each have 3 columns and the first two columns are identifying in the sense of a relational database compound primary key. I want to merge these two arrays based on these compound primary keys, while adding the third column where the two arrays overlap.

>>> x = np.array([[1, 2, 1.5], [3, 4, 2.5], [5, 6, 3.5]])
>>> x
array([[1. , 2. , 1.5],
       [3. , 4. , 2.5],
       [5. , 6. , 3.5]])
>>> y = np.array([[1, 2, 4.5], [3, 4, 5.5], [7, 8, 6.5]])
>>> y
array([[1. , 2. , 4.5],
       [3. , 4. , 5.5],
       [7. , 8. , 6.5]])

I need a method to create array z, which combines x and y like an upsert:

>>> z = np.array([[1, 2, 6.0], [3, 4, 7.0], [5, 6, 3.5], [7, 8, 6.5]])
>>> z
array([[1. , 2. , 6. ],
       [3. , 4. , 7. ],
       [5. , 6. , 3.5],
       [7. , 8. , 6.5]])

Is it possible to do this in Numpy?

CodePudding user response：

This operation can be done in the following steps.

Get a mask whether the first two columns are equal.

mask = x[:, :2] == y[:, :2]

Update the existing values:

x[np.all(mask, axis=1), 2] = x[np.all(mask, axis=1), 2]   y[np.all(mask, axis=1), 2]

Inserting non existing rows:

x = np.append(x, y[np.any(~mask, axis=1)], axis=0)

Edit: For unsorted arrays of different size.

In this case I couldn't find a way as clean as the above case. But the following will perform the job for unsorted and nonequal size arrays.

import numpy as np

x = np.array([[1, 2, 1.5], [3, 4, 2.5], [5, 6, 3.5]])

y = np.array([[1, 2, 4.5], [7, 8, 6.5], [3, 4, 5.5], [8, 9, 7.5]])

print(x)
print(y)

non_existing_rows = []
# Update existing rows and store non existing rows.
for row in y:
    mask = np.all(x[:,:2] == row[:2], axis = 1)
    if np.all(~mask):
        # The row does not exist in x.
        non_existing_rows.append(row)
    else:
        # The row exists in x.
        x[mask,2] = x[mask,2]   row[2]
# Insert non existing rows.
x = np.append(x, non_existing_rows, axis=0)

print(x)

I assume that the size of x is bigger than y in general, so it is better to iterate over y. Also appending operation is faster for lists, so it is better to first aggregate non existing rows in a list and then append them to the numpy array as a whole.