I have a simple problem with Python, but would need it to be solved in less than a minute.
I have two 1D numpy arrays, x1 and x2 which can have different sizes, but count around 300,000 -> 500,000 elements. There is no repetition, values are not necessarily sorted.
x2 represents a transformed x1 array (so x1 is before, x2 is after).
First, I can obtain which elements were "inserted" (I) or "discarded" (D), or even staying (S) between the two steps.
I = list(set(x2) - set(x1))
D = list(set(x1) - set(x2))
S = list(set(x1) & set(x2))
Now, I would like to get a "transition" (T) array which give me the change in index of the elements between the two steps.
Let me give an example:
x1 = [3,5,2,6]
x2 = [1,6,5,3,7]
We have S=[3,5,6], I=[1,7], D=[2]
Here, we have for each x1 element:
- for 3: it goes from index 0 in x1 to 3 in x2 ->
3
- for 5: it goes from index 1 in x1 to 2 in x2 ->
2
- for 2: it gets discarded ->
nan
- for 6: it goes from index 3 in x1 to 1 in x2 ->
1
So I would like T to be [3, 2, nan, 1]
(associated with I = [1,7]
, I have all the information I actually need.
My first (simple) idea was to do:
T = []
for i in range(len(x1)):
for j in range(len(x2)):
if x1[i] == x2[j]:
transition[i] = j
break
However, the time to solve that is very large. So what I am looking for is a time-efficient way of creating this "transition array", and also to know how such thing is called in a more formal way.
Can anybody help me?
CodePudding user response:
There are array functions like np.unique
, np.argsort
, np.isin
, np.in1d
that can be used to test for set like operations. I believe there are some other related set
functions.
But a basic way of comparing arrays is with an outer '==':
In [60]: x1 = [3,5,2,6]; x2 = [1,6,5,3,7]
In [61]: np.array(x1)[:,None]==np.array(x2)
Out[61]:
array([[False, False, False, True, False],
[False, False, True, False, False],
[False, False, False, False, False],
[False, True, False, False, False]])
Going down the rows, 0,1,3 have one True, 2 has none. That looks like your T
.
Going across columns, 0 and 4 are all False. That's your I
.
In [62]: Out[61].any(axis=1)
Out[62]: array([ True, True, False, True])
In [63]: np.array(x1)[Out[61].any(axis=1)]
Out[63]: array([3, 5, 6])
In [64]: np.array(x2)[Out[61].any(axis=0)]
Out[64]: array([6, 5, 3])
In [65]: np.array(x2)[~Out[61].any(axis=0)]
Out[65]: array([1, 7])
CodePudding user response:
Thank you for your answer. Based on this, I wrote this script that gives T and I.
import numpy as np
import random
N = 350000
x1 = np.array(random.sample(range(N), int(N)))
x2 = np.array(random.sample(range(N), int(N)))
M = x1[:,None]==x2
r = np.argwhere(M)
T = np.nan * np.ones(x1.shape)
T[r[:,0]] = r[:,1]
I = np.where(np.isnan(T))[0]
This is more efficient, but still takes around 6 minutes for this to run. Is there anything I should do to accelerate this?