Track changes in indices for numpy array-CodePudding

I have a simple problem with Python, but would need it to be solved in less than a minute.

I have two 1D numpy arrays, x1 and x2 which can have different sizes, but count around 300,000 -> 500,000 elements. There is no repetition, values are not necessarily sorted.

x2 represents a transformed x1 array (so x1 is before, x2 is after).

First, I can obtain which elements were "inserted" (I) or "discarded" (D), or even staying (S) between the two steps.

I = list(set(x2) - set(x1))
D = list(set(x1) - set(x2))
S = list(set(x1) & set(x2))

Now, I would like to get a "transition" (T) array which give me the change in index of the elements between the two steps.

Let me give an example: x1 = [3,5,2,6] x2 = [1,6,5,3,7] We have S=[3,5,6], I=[1,7], D=[2]

Here, we have for each x1 element:

for 3: it goes from index 0 in x1 to 3 in x2 -> 3
for 5: it goes from index 1 in x1 to 2 in x2 -> 2
for 2: it gets discarded -> nan
for 6: it goes from index 3 in x1 to 1 in x2 -> 1

So I would like T to be [3, 2, nan, 1] (associated with I = [1,7], I have all the information I actually need.

My first (simple) idea was to do:

T = []
for i in range(len(x1)):
    for j in range(len(x2)):
        if x1[i] == x2[j]:
            transition[i] = j
            break

However, the time to solve that is very large. So what I am looking for is a time-efficient way of creating this "transition array", and also to know how such thing is called in a more formal way.

Can anybody help me?

CodePudding user response：

There are array functions like np.unique, np.argsort, np.isin, np.in1d that can be used to test for set like operations. I believe there are some other related set functions.

But a basic way of comparing arrays is with an outer '==':

In [60]: x1 = [3,5,2,6]; x2 = [1,6,5,3,7]
In [61]: np.array(x1)[:,None]==np.array(x2)
Out[61]: 
array([[False, False, False,  True, False],
       [False, False,  True, False, False],
       [False, False, False, False, False],
       [False,  True, False, False, False]])

Going down the rows, 0,1,3 have one True, 2 has none. That looks like your T.

Going across columns, 0 and 4 are all False. That's your I.

In [62]: Out[61].any(axis=1)
Out[62]: array([ True,  True, False,  True])
In [63]: np.array(x1)[Out[61].any(axis=1)]
Out[63]: array([3, 5, 6])
In [64]: np.array(x2)[Out[61].any(axis=0)]
Out[64]: array([6, 5, 3])
In [65]: np.array(x2)[~Out[61].any(axis=0)]
Out[65]: array([1, 7])

CodePudding user response：

Thank you for your answer. Based on this, I wrote this script that gives T and I.

import numpy as np
import random
N = 350000
x1 = np.array(random.sample(range(N), int(N)))  
x2 = np.array(random.sample(range(N), int(N)))  
M = x1[:,None]==x2  
r = np.argwhere(M)  
T = np.nan * np.ones(x1.shape)  
T[r[:,0]] = r[:,1]
I = np.where(np.isnan(T))[0]

This is more efficient, but still takes around 6 minutes for this to run. Is there anything I should do to accelerate this?