Can I append to a numpy array based on the content of 2 column?-CodePudding

array = np.array(
        [[  1.,    1.,   82. , 177.,    0.,    0.,   -1. ],
         [  2.,    2.,   83. , 177.,    0.,    0.,    1. ],
         [  3.,    2.,   84. , 177.,    0.,    0.,    2. ],
         [  4.,    2.,   85. , 177.,    0.,    0.,    2. ],
         [  5.,    2.,   82.5, 177.,    0.,    0.,    2. ],
         [  6.,    2.,   83.5, 177.,    0.,    0.,    3. ]])

then I have list of new elements to append which are:

new_points = np.array(
   [[  7.,    2.,   82.5, 177.,    0.,    0.,    2. ],
    [  8.,    2.,   83.5, 177.,    0.,    0.,    4. ],
    [  9.,    2.,   84.5, 177.,    0.,    0.,    4. ],
    [ 10.,    2.,   84. , 177.,    0.,    0.,    4. ]])

as you can see some rows have the same values in the 3rd and 4th columns that are also present in the array. So, I want to append only the point which combination of values in the 3rd and 4th columns are not present in the original array.

the output that I expect is:

 array = [[  1.    1.   82.  177.    0.    0.   -1. ]
             [  2.    2.   83.  177.    0.    0.    1. ]
             [  3.    2.   84.  177.    0.    0.    2. ]
             [  4.    2.   85.  177.    0.    0.    2. ]
             [  5.    2.   82.5 177.    0.    0.    2. ]
             [  6.    2.   83.5 177.    0.    0.    3. ]
             [  9.    2.   84.5 177.    0.    0.    4. ]]

CodePudding user response：

You can use the following approach as well to achieve that goal:

import numpy as np

array = np.array(
        [[  1.,    1.,   82. , 177.,    0.,    0.,   -1. ],
         [  2.,    2.,   83. , 177.,    0.,    0.,    1. ],
         [  3.,    2.,   84. , 177.,    0.,    0.,    2. ],
         [  4.,    2.,   85. , 177.,    0.,    0.,    2. ],
         [  5.,    2.,   82.5, 177.,    0.,    0.,    2. ],
         [  6.,    2.,   83.5, 177.,    0.,    0.,    3. ]])

new_points = np.array(
   [[  7.,    2.,   82.5, 177.,    0.,    0.,    2. ],
    [  8.,    2.,   83.5, 177.,    0.,    0.,    4. ],
    [  9.,    2.,   84.5, 177.,    0.,    0.,    4. ],
    [ 10.,    2.,   84. , 177.,    0.,    0.,    4. ]])

filtered_points = []
for point in new_points:
    if not np.any((array[:,2] == point[2]) & (array[:,3] == point[3])):
        filtered_points.append(point)
result = np.concatenate((array, filtered_points))

print(result)

Output:

[[  1.    1.   82.  177.    0.    0.   -1. ]
 [  2.    2.   83.  177.    0.    0.    1. ]
 [  3.    2.   84.  177.    0.    0.    2. ]
 [  4.    2.   85.  177.    0.    0.    2. ]
 [  5.    2.   82.5 177.    0.    0.    2. ]
 [  6.    2.   83.5 177.    0.    0.    3. ]
 [  9.    2.   84.5 177.    0.    0.    4. ]]

Explanation:
1. Create an empty list called filtered_points.

2. Iterate over the elements in new_points. For each element, check if the combination of values in the 3rd and 4th columns is present in the array. If it is not present, append the element to filtered_points.

3. Use the np.concatenate function to concatenate array and filtered_points and assign the result to a new variable called result.

The resulting result array will contain only the elements from new_points that have a combination of values in the 3rd and 4th columns that are not present in array.

CodePudding user response：

A pure `numpy` vectorized solution

This an interesting problem that truly shows how powerful numpy can be if you understand broadcasting. Avoid using any for loops for this.

You can do this in a completely vectorized way using broadcasting to compare duplicates in the specific columns (3rd, 4th) and then fetching those rows, reducing the dimensions to a boolean array for the new points, and stacking them with original as below. For more details read the NumPy documentation on how broadcasting works.

cond = ~(array[:,None,2:4] == new_points[None,:,2:4]).all(-1).any(0)
updated_array = np.vstack([array,new_points[cond]])
updated_array

array([[  1. ,   1. ,  82. , 177. ,   0. ,   0. ,  -1. ],
       [  2. ,   2. ,  83. , 177. ,   0. ,   0. ,   1. ],
       [  3. ,   2. ,  84. , 177. ,   0. ,   0. ,   2. ],
       [  4. ,   2. ,  85. , 177. ,   0. ,   0. ,   2. ],
       [  5. ,   2. ,  82.5, 177. ,   0. ,   0. ,   2. ],
       [  6. ,   2. ,  83.5, 177. ,   0. ,   0. ,   3. ],
       [  9. ,   2. ,  84.5, 177. ,   0. ,   0. ,   4. ]])

EXPLANATION

Here is the flow of shapes for each step -

#Broadcasting rules

(6, 1, 2) # array[:,None,2:4]
(1, 4, 2) # new_points[None,:,2:4]
---------
(6, 4, 2) # == compare with broadcasting
---------
(6, 4)    # .all(-1)
(4,)      # .any(0)
(4,)      # ~ invert boolean

array[:,None,2:4] and new_points[None,:,2:4] fetches the 3rd and 4th column but also adds a dummy dimension in the arrays. This makes their shape as (6, 1, 2) and (1, 4, 2) respectively. Why is this important?
Because this allows us to use broadcasting to compare the 6 rows to the 4 rows. This is done with just the == step. This step array[:,None,2:4] == new_points[None,:,2:4] basically results in a (6, 4, 2) boolean matrix that compares every value from the 3rd and 4th column respectively across the 6 and 4 rows.
Since you want to match both values exactly, you can use .all(-1) which reduces the last dimension giving you are (6,4) matrix with True and False values. (True means both values match, False means both values don't match)

array([[False, False, False, False],
       [False, False, False, False],
       [False, False, False,  True],
       [False, False, False, False],
       [ True, False, False, False],
       [False,  True, False, False]])

Finally, since I want to filter only the rows that are duplicate in the original matrix (based on 3rd and 4th column) I can reduce the first axis (0) to get a (4,) boolean array, using .any(0)

array([ True,  True, False,  True])

Oh, and we want to finally also switch the False to True and vice versa, since True means keep and False means drop in boolean indexing. This is done with the ~ at the start.

array([False, False,  True, False]) #this is the cond variable

Last step would be to filter the new_points array and stack it to the array, using np.vstack

Benchmarks -

Just to show the power of vectorization, here are some benchmarks from other answers -

#Vectorized solution

%%timeit

cond = ~(array[:,None,2:4] == new_points[None,:,2:4]).all(-1).any(0)
updated_array = np.vstack([array,new_points[cond]])

# 9.77 µs ± 145 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

# For loop solution

%%timeit

filtered_points = []
for point in new_points:
    if not np.any((array[:,2] == point[2]) & (array[:,3] == point[3])):
        filtered_points.append(point)
result = np.concatenate((array, filtered_points))

#28.8 µs ± 651 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

CodePudding user response：

Something like this should work:

array = [
    [1., 1., 82. , 177., 0., 0., 1.],
    [2., 2., 83. , 177., 0., 0., 1.],
    [3., 2., 84. , 177., 0., 0., 2.],
    [4., 2., 85. , 177., 0., 0., 2.],
    [5., 2., 82.5, 177., 0., 0., 2.],
    [6., 2., 83.5, 177., 0., 0., 3.]
]

new_points = [
    [ 7., 2., 82.5, 177., 0., 0., 2.],
    [ 8., 2., 83.5, 177., 0., 0., 4.],
    [ 9., 2., 84.5, 177., 0., 0., 4.],
    [10., 2., 84. , 177., 0., 0., 4.]
]

for point in new_points:
    # Check if the combination of values in the 3rd and 4th columns is present in the array
    if not any(math.isclose(point[2], x[2]) and math.isclose(point[3], x[3]) for x in array):
        # If the combination is not present, append the point to the array
        array.append(point)

# The resulting array will contain the original points plus the new points that had a unique combination of values in the 3rd and 4th columns
print(array)

You shouldn't compare floats with == operator.

A pure numpy vectorized solution

EXPLANATION

Benchmarks -

A pure `numpy` vectorized solution