Basically, I have a problem in which I have two arrays of length L
: a data array (let's call it D
), representing my actual data, and a validity array (called here V
), with boolean values, saying which of these values are valid.
For instance, imagine I have:
D = [10, 20, 40, 1000, 2000, -1000, 50, 20, 1000]
V = [1, 1, 1, 0, 0, 0, 1, 1, 0]
In this case, my V
array indicates that values on indexes 3, 4, 5 and 8 are invalid.
For each of these indexes, I want to replace the corresponding data values D[i]
with the closest valid data. So, my index finding function would give:
f(V) = [0, 1, 2, 2, 2, 6, 6, 7, 7]
(or f(V) = [0, 1, 2, 2, 6, 6, 6, 7, 7]
, doesn't really matter)
In this case, I could correct my D
array with:
D[i] = D[f(V)]
And get:
D = [10, 20, 40, 40, 40, 50, 50, 20, 20]
Is something like this implemented in Python? If not, how could I implement this easily?
CodePudding user response:
You can use pandas
and interpolate
:
df = pd.DataFrame({'D': D, 'V': V})
D2 = (df['D']
.mask(df['V'].eq(0))
.interpolate(method='nearest')
.ffill(downcast='infer')
.tolist()
)
output: [10, 20, 40, 40, 40, 50, 50, 20, 20]
CodePudding user response:
If you are willing to use numpy
, that can be done pretty concisely (although this involves a O(m*k)
implicit loop, with m
and k
the number of valid and invalid indices):
import numpy as np
d = np.array(D)
v = np.array(V, dtype=bool)
diff = np.abs(d[~v, None] - d[None, v])
out = np.arange(len(d))
out[~v] = np.nonzero(v)[0][np.argmin(diff, axis=1)]
>>> out
array([0, 1, 2, 6, 6, 0, 6, 7, 6])