How to create a new dataframe column with a set of nested IF rules (apply is very slow)-CodePudding

What I need

I need to create new columns in pandas dataframes, based on a set of nested if statements. E.g.

if city == 'London' and income > 10000:
    return 'group 1'
elif city == 'Manchester' or city == 'Leeds':
    return 'group 2'
elif borrower_age > 50:
    return 'group 3'
else:
    return 'group 4'

This is actually a simplifcation - in most cases I'd need to create something like 10 or more possible outputs, not 4 as above, but you hopefully get the gist.

The issue

My problem is that I have not found a way to make the code acceptably fast. I understand that, if the choice were binary, I could use something like numpy.where() , but I have not found a way to vectorise the code or anyway to make it fast enough. I suppose I could probably nest a number of np.where statements and that would be faster, but then the code would be harder to read and much more prone to errors.

What I have tried

I have tried the following:

 ──────────────────────────────────────────────── ────────────── 
| Method                                         | Time (secs)  |
 ──────────────────────────────────────────────── ────────────── 
| dataframe.apply                                | 29           |
| dataframe.apply on a numba-optimised function  | 31           |
| sqlite                                         | 16           |
 ──────────────────────────────────────────────── ──────────────

"sqlite" means: loading the dataframe into a sqlite in-memory database, creating the new field there, and then exporting back to a dataframe

Sqlite is faster but still unacceptably slow: the same thing on a SQL Server running on the same machine takes less than a second. I'd rather not rely on an external SQL server, though, because the code should be able to run even on machines with no access to a SQL server.

I also tried to create a numba function which loops through the rows one by one, but I understand that numba doesn't support strings (or at least I couldn't get that to work).

Toy example

import numpy as np
import pandas as pd
import sqlite3
import time
import numba
start = time.time()
running_time = pd.Series()

n = int(1e6)



df1 = pd.DataFrame()
df1['income']=np.random.lognormal(0.4,0.4, n) *20e3
df1['loan balance'] = np.maximum(0, np.minimum(30e3,  5e3 * np.random.randn(n)   20e3 ) )
df1['city'] = np.random.choice(['London','Leeds','Truro','Manchester','Liverpool'] , n )
df1['city'] = df1['city'].astype('|S80')
df1['borrower age'] = np.maximum(22, np.minimum(70,  30 * np.random.randn(n)   30 ) )
df1['# children']=np.random.choice( [0,1,2,3], n, p= [0.4,0.3,0.25,0.05] )
df1['rate'] = np.maximum(0.5e-2, np.minimum(10e-2,  1e-2 * np.random.randn(n)   4e-2 ) )

running_time['data frame creation'] = time.time() - start


conn = sqlite3.connect(":memory:", detect_types = sqlite3.PARSE_DECLTYPES)
cur = conn.cursor()

df1.to_sql("df1", conn, if_exists ='replace')

cur.execute("ALTER TABLE df1 ADD new_field nvarchar(80)")

cur.execute('''UPDATE df1
            SET new_field = case when city = 'London' AND income > 10000 then 'group 1'
            when city = 'Manchester' or city = 'Leeds' then 'group 2'
            when 'borrower age' > 50 then 'group 3'
            else 'group 4'
            end
            ''')
df_from_sql = pd.read_sql('select * from df1', conn)

running_time['sql lite'] = time.time() - start


def my_func(city, income, borrower_age):
    if city == 'London' and income > 10000:
        return 'group 1'
    elif city == 'Manchester' or city == 'Leeds':
        return 'group 2'
    elif borrower_age > 50:
        return 'group 3'
    else:
        return 'group 4'
    
df1['new_field'] = df1.apply( lambda x: my_func( x['city'], x['income'], x['borrower age'] )  , axis =1)

running_time['data frame apply'] = time.time() - start

@numba.jit(nopython = True)
def my_func_numba_apply(city, income, borrower_age):
    if city == 'London' and income > 10000:
        return 'group 1'
    elif city == 'Manchester' or city == 'Leeds':
        return 'group 2'
    elif borrower_age > 50:
        return 'group 3'
    else:
        return 'group 4'
    
df1['new_field numba_apply'] = df1.apply( lambda x: my_func_numba_apply( x['city'], x['income'], x['borrower age'] )  , axis =1)
running_time['data frame numba'] = time.time() - start

   
x = np.concatenate(([0], running_time))
execution_time = pd.Series(np.diff(x) , running_time.index)


print(execution_time)

What I need

The issue

What I have tried

Toy example

Other questions I have found