Is there a way to vectorize the following in a DataFrame
:
d = a.diff()
out = d >=0 and a > b ? 0 : d < 0 and a > b ? 1 : d <= 0 and a < b ? 2 : d >= 0 and a < b ? 3 : 4
Here, both a
and b
are a simple timeseries in the form of a DataFrame
or Series
with unimportant data. In my test files I generate them as such:
range = pd.date_range('2019-01-01', freq='MS', periods=10)
a = pd.DataFrame(np.random.randint(20, size=(10, 1)),
columns=['value'],
index=range)
. What I thought about doing was using map()
to something that looks like this:
out = (
(d >= 0) & (a > b)
).map({
True: 0,
False: (
(d < 0) & (a > b)
).map({ ...
This is of course not correct, because of the False
case, but hopefully you get the idea.
How can I vectorize this complex statement into a DataFrame?
CodePudding user response:
Perhaps the equivalent lambda
and np.vectorize()
would be of use?
# out = d >=0 and a > b ? 0 : d < 0 and a > b ? 1 : d <= 0 and a < b ? 2 : d >= 0 and a < b ? 3 : 4
import numpy as np
import pandas as pd
out = lambda a, b, d: 0 if d >= 0 and a > b else 1 if d < 0 and a > b else 2 if d <= 0 and a < b else 3 if d >= 0 and a < b else 4
out = np.vectorize(out)
df = pd.DataFrame({'A': {0: 1, 1: 2, 2: 3}, 'B': {0: 5, 1: 4, 2: 3}, 'C': {0: 4, 1: 4, 2: 4}})
print(out(df['A'], df['B'], df['C']))
Results:
[3 3 4]
CodePudding user response:
There are a couple of issues with the tests themselves:
d = a.diff()
means the first element ofd
isNaN
. All the tests against a constant will returnFalse
(evend == d
isFalse
for the first element, which BTW is the canonical way to check for aNaN
value).- Your tests are not neatly separable into two independent tests (e.g.
a < b
on one hand and < 0
on the other). Is that intentional? It sure complicates things.
First, let's express your conditions as Python ternary equivalents:
out = [
0 if d >= 0 and a > b else
1 if d < 0 and a > b else
2 if d <= 0 and a < b else
3 if d >= 0 and a < b else
4 for a, b, d in zip(a.squeeze(), b.squeeze(), d.squeeze())
]
Obviously, that isn't vectorized.
To vectorize such an expression, let's look at the values table:
a < b |
a == b |
a > b |
|
---|---|---|---|
d < 0 |
2 | 4 | 1 |
d == 0 |
2 | 4 | 0 |
d > 0 |
3 | 4 | 0 |
d.isna() |
4 | 4 | 4 |
You can get a vectorized implementation of that table (note: I'm just using the numpy
arrays for concision):
a_lt_b = a <= b
d_lt_0 = d < 0
d_isna = np.isna(d)
a_eq_b = a == b
out2 = a_lt_b * 2 (d_lt_0 ^ a_lt_b)
out2[a_lt_b & (d == 0)] = 2
out2[d_isna | a_eq_b] = 4
Verification
a = np.tile([-1, 0, 1], 4)
b = np.zeros_like(a)
d = np.repeat([np.nan, -1, 0, 1], 3)
out = [
0 if d >= 0 and a > b else
1 if d < 0 and a > b else
2 if d <= 0 and a < b else
3 if d >= 0 and a < b else
4 for a, b, d in zip(a, b, d)
]
# visualization
def view(df, valname='out'):
df = df.assign(c1='a == b', c2='d == 0')
df.loc[df.a < df.b, 'c1'] = 'a < b'
df.loc[df.a > df.b, 'c1'] = 'a > b'
df.loc[df.d < 0, 'c2'] = 'd < 0'
df.loc[df.d > 0, 'c2'] = 'd > 0'
df.loc[df.d.isna(), 'c2'] = 'd.isna()'
return df.set_index(['c1', 'c2'])[valname].unstack('c1')
After the calculation of out2
above:
z = pd.DataFrame(dict(a=a, b=b, d=d, out=out, out2=out2))
>>> np.array_equal(z['out'], z['out2'])
>>> view(z, 'out2') # same as for 'out'
c1 a < b a == b a > b
c2
d < 0 2 4 1
d == 0 2 4 0
d > 0 3 4 0
d.isna() 4 4 4