Home > front end >  Long IF-ELSE vectorized DataFrame creation
Long IF-ELSE vectorized DataFrame creation

Time:04-19

Is there a way to vectorize the following in a DataFrame:

d = a.diff()
out = d >=0 and a > b ? 0 : d < 0 and a > b ? 1 : d <= 0 and a < b ? 2 : d >= 0 and a < b ? 3 : 4

Here, both a and b are a simple timeseries in the form of a DataFrame or Series with unimportant data. In my test files I generate them as such:

range = pd.date_range('2019-01-01', freq='MS', periods=10)
a = pd.DataFrame(np.random.randint(20, size=(10, 1)), 
                  columns=['value'], 
                  index=range)

. What I thought about doing was using map() to something that looks like this:

out = (
        (d >= 0) & (a > b)
).map({
    True: 0,
    False: (
            (d < 0) & (a > b)
    ).map({ ...

This is of course not correct, because of the False case, but hopefully you get the idea.

How can I vectorize this complex statement into a DataFrame?

CodePudding user response:

Perhaps the equivalent lambda and np.vectorize() would be of use?

# out = d >=0 and a > b ? 0 : d < 0 and a > b ? 1 : d <= 0 and a < b ? 2 : d >= 0 and a < b ? 3 : 4
import numpy as np
import pandas as pd

out = lambda a, b, d: 0 if d >= 0 and a > b else 1 if d < 0 and a > b else 2 if d <= 0 and a < b else 3 if d >= 0 and a < b else 4

out = np.vectorize(out)

df = pd.DataFrame({'A': {0: 1, 1: 2, 2: 3}, 'B': {0: 5, 1: 4, 2: 3}, 'C': {0: 4, 1: 4, 2: 4}})

print(out(df['A'], df['B'], df['C']))

Results:

[3 3 4]

CodePudding user response:

There are a couple of issues with the tests themselves:

  1. d = a.diff() means the first element of d is NaN. All the tests against a constant will return False (even d == d is False for the first element, which BTW is the canonical way to check for a NaN value).
  2. Your tests are not neatly separable into two independent tests (e.g. a < b on one hand an d < 0 on the other). Is that intentional? It sure complicates things.

First, let's express your conditions as Python ternary equivalents:

out = [
    0 if d >= 0 and a > b else
    1 if d < 0 and a > b else
    2 if d <= 0 and a < b else
    3 if d >= 0 and a < b else
    4 for a, b, d in zip(a.squeeze(), b.squeeze(), d.squeeze())
]

Obviously, that isn't vectorized.

To vectorize such an expression, let's look at the values table:

a < b a == b a > b
d < 0 2 4 1
d == 0 2 4 0
d > 0 3 4 0
d.isna() 4 4 4

You can get a vectorized implementation of that table (note: I'm just using the numpy arrays for concision):

a_lt_b = a <= b
d_lt_0 = d < 0
d_isna = np.isna(d)
a_eq_b = a == b

out2 = a_lt_b * 2   (d_lt_0 ^ a_lt_b)
out2[a_lt_b & (d == 0)] = 2
out2[d_isna | a_eq_b] = 4

Verification

a = np.tile([-1, 0, 1], 4)
b = np.zeros_like(a)
d = np.repeat([np.nan, -1, 0, 1], 3)
out = [
    0 if d >= 0 and a > b else
    1 if d < 0 and a > b else
    2 if d <= 0 and a < b else
    3 if d >= 0 and a < b else
    4 for a, b, d in zip(a, b, d)
]

# visualization
def view(df, valname='out'):
    df = df.assign(c1='a == b', c2='d == 0')
    df.loc[df.a < df.b, 'c1'] = 'a < b'
    df.loc[df.a > df.b, 'c1'] = 'a > b'
    df.loc[df.d < 0, 'c2'] = 'd < 0'
    df.loc[df.d > 0, 'c2'] = 'd > 0'
    df.loc[df.d.isna(), 'c2'] = 'd.isna()'
    return df.set_index(['c1', 'c2'])[valname].unstack('c1')

After the calculation of out2 above:

z = pd.DataFrame(dict(a=a, b=b, d=d, out=out, out2=out2))

>>> np.array_equal(z['out'], z['out2'])

>>> view(z, 'out2')  # same as for 'out'
c1        a < b  a == b  a > b
c2                            
d < 0         2       4      1
d == 0        2       4      0
d > 0         3       4      0
d.isna()      4       4      4
  • Related