I have the following numpy array
import numpy as np
arr = np.array([1,1,1,2,2,2,3,3,2,2,2,1,1,1,2,2])
I split this array into parts, where each part has the same value consequently using this question
def consecutive(data, stepsize=1):
return np.split(data, np.where(np.diff(data) != stepsize)[0] 1)
consecutive(arr, stepsize=0)
which yields
[array([1, 1, 1]),
array([2, 2, 2]),
array([3, 3]),
array([2, 2, 2]),
array([1, 1, 1]),
array([2, 2])]
I would like, for every sub-part above, if its (unique) element has appeared before, to add to this sub-part 0.001 * times_of_appearences_before_that
I tried this:
arr_f = []
times_appeared_dict = dict(zip([str(l) for l in list(np.unique(arr))], [-1]*len(list(np.unique(arr))))) # dictionary which will count the times of appearences
for sub_arr in consecutive(arr, stepsize=0):
arr_f.append(sub_arr)
arr_f_tmp = np.concatenate(arr_f).ravel()
if np.unique(sub_arr) in arr_f_tmp:
times_appeared_dict[str(np.unique(sub_arr)[0])] = times_appeared_dict[str(np.unique(sub_arr)[0])] 1
# then add the 0.0001 to the elements, starting from the end
arr_ff = []
for sub_arr in reversed(consecutive(arr, stepsize=0)):
sub_arr_f = sub_arr 0.0001*times_appeared_dict[str(np.unique(sub_arr)[0])]
times_appeared_dict[str(np.unique(sub_arr)[0])] = times_appeared_dict[str(np.unique(sub_arr)[0])] - 1
arr_ff.append(sub_arr_f)
arr_ff = np.concatenate(arr_ff).ravel()
# revert the order back to initial
arr_fff = []
for sub_arr in reversed(consecutive(arr_ff, stepsize=0)):
arr_fff.append(sub_arr)
arr_fff = np.concatenate(arr_fff).ravel()
arr_fff
which yields
array([1. , 1. , 1. , 2. , 2. , 2. , 3. , 3. ,
2.0001, 2.0001, 2.0001, 1.0001, 1.0001, 1.0001, 2.0002, 2.0002])
which is the correct result. I was wondering if there is a smarter way to do that (avoiding all these loops etc)
CodePudding user response:
Analysis of your code
Your code is quite convoluted:
So, let's comment
times_appeared_dict = dict(zip([str(l) for l in list(np.unique(arr))], [-1]*len(list(np.unique(arr))))) # dictionary which will count the times of appearences
# As Ahmed said, no need to use str(l) as key here. l is a better key, and we spare a `str`
times_appeared_dict = dict(zip([l for l in list(np.unique(arr))], [-1]*len(list(np.unique(arr)))))
# Also `np.unique(arr)` is iterable. No need to iterate list(np.unique(arr))
times_appeared_dict = dict(zip([l for l in np.unique(arr)], [-1]*len(np.unique(arr))))
# Since you use np.unique(arr) 2 times, let compute it once only
listUniq=np.unique(arr)
times_appeared_dict = dict(zip([l for l in listUniq], [-1]*len(listUniq)))
# [l for l in aList] is just aList
times_appeared_dict = dict(zip(listUniq, [-1]*len(listUniq)))
Now the next for loop
for sub_arr in consecutive(arr, stepsize=0):
arr_f.append(sub_arr)
arr_f_tmp = np.concatenate(arr_f).ravel()
if np.unique(sub_arr) in arr_f_tmp:
times_appeared_dict[str(np.unique(sub_arr)[0])] = times_appeared_dict[str(np.unique(sub_arr)[0])] 1
# We have decided to use integer as keys, so let remove str
for sub_arr in consecutive(arr, stepsize=0):
arr_f.append(sub_arr)
arr_f_tmp = np.concatenate(arr_f).ravel()
if np.unique(sub_arr) in arr_f_tmp:
times_appeared_dict[np.unique(sub_arr)[0]] = times_appeared_dict[np.unique(sub_arr)[0]] 1
# As we did for listUniq, since we use consecutive(arr, stepsize=0) several times, let's compute it once
conseq=consecutive(arr, stepsize=0)
for sub_arr in conseq:
arr_f.append(sub_arr)
arr_f_tmp = np.concatenate(arr_f).ravel()
if np.unique(sub_arr) in arr_f_tmp:
times_appeared_dict[np.unique(sub_arr)[0]] = times_appeared_dict[np.unique(sub_arr)[0]] 1
# np.unique(sub_arr)[0] can't be anything else than sub_arr[0] (and sub_arr can't be empty)
conseq=consecutive(arr, stepsize=0)
for sub_arr in conseq:
arr_f.append(sub_arr)
arr_f_tmp = np.concatenate(arr_f).ravel()
if np.unique(sub_arr) in arr_f_tmp:
times_appeared_dict[sub_arr[0]] = times_appeared_dict[sub_arr[0]] 1
# Since you just increase times_appeared_dict[sub_arr[0]] it is better to use =1. It avoids double estimation of place where times_appeared_dict[sub_arr[0]] is stored sub_arr[0] (and sub_arr can`t be empty)
conseq=consecutive(arr, stepsize=0)
for sub_arr in conseq:
arr_f.append(sub_arr)
arr_f_tmp = np.concatenate(arr_f).ravel()
if np.unique(sub_arr) in arr_f_tmp:
times_appeared_dict[sub_arr[0]] = 1
# The test np.unique(sub_arr) in arr_f_tmp is void
# sub_arr has to be in arr_f_tmp, you put it there just a few line sooner
conseq=consecutive(arr, stepsize=0)
for sub_arr in conseq:
arr_f.append(sub_arr)
arr_f_tmp = np.concatenate(arr_f).ravel()
times_appeared_dict[sub_arr[0]] = 1
# So we don't really need arr_f_tmp
conseq=consecutive(arr, stepsize=0)
for sub_arr in conseq:
arr_f.append(sub_arr)
times_appeared_dict[sub_arr[0]] = 1
Next loop
for sub_arr in reversed(consecutive(arr, stepsize=0)):
sub_arr_f = sub_arr 0.0001*times_appeared_dict[str(np.unique(sub_arr)[0])]
times_appeared_dict[str(np.unique(sub_arr)[0])] = times_appeared_dict[str(np.unique(sub_arr)[0])] - 1
arr_ff.append(sub_arr_f)
# Again, remove str, conseq
# replace np.unique(x)[0] by x[0], as before
# And, as for =1 before, use -=1
for sub_arr in reversed(conseq):
sub_arr_f = sub_arr 0.0001*times_appeared_dict[sub_arr[0]]
times_appeared_dict[sub_arr[0]] -= 1
arr_ff.append(sub_arr_f)
And the last one. All you are doing here is to
concatenate in reverse order content of arr_ff
arr_fff = []
for sub_arr in arr_ff[::-1]:
arr_fff.extend(sub_arr)
# Since I've use extend and not sub_arr, no need for concatenate/ravel. Just turn the list in ndarray
arr_fff = np.array(arr_fff)
(Note: I've check that each of these steps optimized the timing. Most of the time it is obvious, but for some it was not a priori sure)
Optimized version of your code
So all together, this is still your code, with optimizations proposed (no change at all in the algorithm. Just a some python optimisation)
def orgopt(data):
arr_f = []
listUniq=np.unique(arr)
conseq=consecutive(arr, stepsize=0)
times_appeared_dict = dict(zip(listUniq, [-1]*len(listUniq))) # dictionary which will count the times of appearences
for sub_arr in conseq:
arr_f.append(sub_arr)
times_appeared_dict[sub_arr[0]] =1
# then add the 0.0001 to the elements, starting from the end
arr_ff = []
for sub_arr in reversed(conseq):
sub_arr_f = sub_arr 0.0001*times_appeared_dict[sub_arr[0]]
times_appeared_dict[sub_arr[0]] -= 1
arr_ff.append(sub_arr_f)
# revert the order back to initial
arr_fff = []
for sub_arr in arr_ff[::-1]:
arr_fff.extend(sub_arr)
return np.array(arr_fff)
Optimisation of algorithm
Now, going further than just python optimization, one could wonder why would you count number of occurrence forward first, and then backward. It force you to work in 2 steps, and then to reverse the list. Hence the 3 loops.
You could just count the number of occurrences and the 0.0001 addition in the same loop, forward.
def forward(data):
r=[]
d={}
for l in consecutive(arr, stepsize=0):
k=d.get(l[0],0)
r.extend(l 0.0001*k)
d[l[0]] = k 1
return np.array(r)
That is still the same principle (using consecutive
to get sublist, and then counting in a dictionary. But simpler: we add sublists 0.0001×number of previous iterations.
Note that dic.get(key, 0)
is just a way to avoid filling the dictionary first, and so avoid to compute uniq
: it returns 0 if key is absent, and the value associated to key otherwise. So, dictionary is implicitly initialized to 0.
Numpy version
This last version was not a python optimization, but an algorithm optimization. Yet it is still the same principle. See what we can do by thinking all the problem over.
You have an array
arr=np.array([1,1,1,2,2,2,3,3,2,2,2,1,1,1,2,2])
Let's add a sentinel at the beginning
arrs=np.concatenate(([-1],arr))
So now arrs=[-1,1,1,1,2,2,2,3,3,2,2,2,1,1,1,2,2]
Reason I a doing so is because now if I compute
np.diff(arrs)
I get a array of differences, of the same shape as arr (not 1 less), and starting with a non 0 : [ 2, 0, 0, 1, 0, 0, 1, 0, -1, 0, 0, -1, 0, 0, 1, 0]
So np.diff(arrs)!=0
is a boolean array
[ True, False, False, True, False, False, True, False, True, False, False, True, False, False, True, False]
And 1*(np.diff(arrs)!=0)
is the same but as 0/1 :
[1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0]
So simply a 1 everywhere we encounter a new value (different from previous, including first), 0 otherwise
Now, multiply that array by say (arr==2)
, and we keep only the 1
where that new value is 2.
1*(np.diff(arrs)!=0)*(arr==2)
# [0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0]
Now, compute cumulative sum of that
np.cumsum(1*(np.diff(arrs)!=0)*(arr==2))
# Now we could loose the `1*` since cumsum of boolean array gives the same result
# [0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3]
So, what we get is the number of occurrence of series of 2. We wanted previous occurrences, so lets remove 1
np.cumsum((np.diff(arrs)!=0)*(arr==2))-1
# [-1, -1, -1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2]
If we ignore places where arr is not 2, we have number of previous occurrences of 2. Let's filter that by multliplying again by arr==2
(np.cumsum((np.diff(arrs)!=0)*(arr==2))-1)*(arr==2)
# array([0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 2, 2])
If we multiply that by 0.0001 we have what we need to add to the 2 values
Just do that for all values, not just 2.
def wcumsum(data):
arrs=np.concatenate(([-1],arr))
nn=(np.diff(arrs)!=0)*1
return arr 0.0001*sum((np.cumsum((arr==k)*nn)-1)*(arr==k) for k in np.unique(arr))
Timings
(On a list of 320 elements. Short list are less favorable to numpy version :-). On your 16 elements list, it was just the same, or almost so)
Methode | Timing |
---|---|
Your code | 6359 μs |
Optimized version of your code | 437 μs |
1 pass algorithm | 318 μs |
cumsum | 53 μs |
CodePudding user response:
you could create the value , length pairs from you split out function and then just create your output, like this:
splits = consecutive(arr, stepsize=0)
l2 = np.array(list(map(lambda x: (x[0], len(x)) , splits)), dtype=float)
a, b = l2[:,0], l2[:,1]
for val in np.unique(a):
idx = np.where(a == val)[0]
a[idx] = val np.arange(len(idx)) * 0.001
out = np.array([val for val, l in zip(a, b) for i in range(int(l))])
output:
array([1. , 1. , 1. , 2. , 2. , 2. , 3. , 3. , 2.001,
2.001, 2.001, 1.001, 1.001, 1.001, 2.002, 2.002])