What I need
I need to create new columns in pandas dataframes, based on a set of nested if statements. E.g.
if city == 'London' and income > 10000:
return 'group 1'
elif city == 'Manchester' or city == 'Leeds':
return 'group 2'
elif borrower_age > 50:
return 'group 3'
else:
return 'group 4'
This is actually a simplifcation - in most cases I'd need to create something like 10 or more possible outputs, not 4 as above, but you hopefully get the gist.
The issue
My problem is that I have not found a way to make the code acceptably fast.
I understand that, if the choice were binary, I could use something like numpy.where()
, but I have not found a way to vectorise the code or anyway to make it fast enough.
I suppose I could probably nest a number of np.where
statements and that would be faster, but then the code would be harder to read and much more prone to errors.
What I have tried
I have tried the following:
──────────────────────────────────────────────── ──────────────
| Method | Time (secs) |
──────────────────────────────────────────────── ──────────────
| dataframe.apply | 29 |
| dataframe.apply on a numba-optimised function | 31 |
| sqlite | 16 |
──────────────────────────────────────────────── ──────────────
"sqlite" means: loading the dataframe into a sqlite in-memory database, creating the new field there, and then exporting back to a dataframe
Sqlite is faster but still unacceptably slow: the same thing on a SQL Server running on the same machine takes less than a second. I'd rather not rely on an external SQL server, though, because the code should be able to run even on machines with no access to a SQL server.
I also tried to create a numba function which loops through the rows one by one, but I understand that numba doesn't support strings (or at least I couldn't get that to work).
Toy example
import numpy as np
import pandas as pd
import sqlite3
import time
import numba
start = time.time()
running_time = pd.Series()
n = int(1e6)
df1 = pd.DataFrame()
df1['income']=np.random.lognormal(0.4,0.4, n) *20e3
df1['loan balance'] = np.maximum(0, np.minimum(30e3, 5e3 * np.random.randn(n) 20e3 ) )
df1['city'] = np.random.choice(['London','Leeds','Truro','Manchester','Liverpool'] , n )
df1['city'] = df1['city'].astype('|S80')
df1['borrower age'] = np.maximum(22, np.minimum(70, 30 * np.random.randn(n) 30 ) )
df1['# children']=np.random.choice( [0,1,2,3], n, p= [0.4,0.3,0.25,0.05] )
df1['rate'] = np.maximum(0.5e-2, np.minimum(10e-2, 1e-2 * np.random.randn(n) 4e-2 ) )
running_time['data frame creation'] = time.time() - start
conn = sqlite3.connect(":memory:", detect_types = sqlite3.PARSE_DECLTYPES)
cur = conn.cursor()
df1.to_sql("df1", conn, if_exists ='replace')
cur.execute("ALTER TABLE df1 ADD new_field nvarchar(80)")
cur.execute('''UPDATE df1
SET new_field = case when city = 'London' AND income > 10000 then 'group 1'
when city = 'Manchester' or city = 'Leeds' then 'group 2'
when 'borrower age' > 50 then 'group 3'
else 'group 4'
end
''')
df_from_sql = pd.read_sql('select * from df1', conn)
running_time['sql lite'] = time.time() - start
def my_func(city, income, borrower_age):
if city == 'London' and income > 10000:
return 'group 1'
elif city == 'Manchester' or city == 'Leeds':
return 'group 2'
elif borrower_age > 50:
return 'group 3'
else:
return 'group 4'
df1['new_field'] = df1.apply( lambda x: my_func( x['city'], x['income'], x['borrower age'] ) , axis =1)
running_time['data frame apply'] = time.time() - start
@numba.jit(nopython = True)
def my_func_numba_apply(city, income, borrower_age):
if city == 'London' and income > 10000:
return 'group 1'
elif city == 'Manchester' or city == 'Leeds':
return 'group 2'
elif borrower_age > 50:
return 'group 3'
else:
return 'group 4'
df1['new_field numba_apply'] = df1.apply( lambda x: my_func_numba_apply( x['city'], x['income'], x['borrower age'] ) , axis =1)
running_time['data frame numba'] = time.time() - start
x = np.concatenate(([0], running_time))
execution_time = pd.Series(np.diff(x) , running_time.index)
print(execution_time)
Other questions I have found
I have found a number of other questions, but none which directly addresses my point.
Most other questions were either easily to vectorise (e.g. just two choices, so np.where
works well) or they recommended a numba-based solution, which in my case actually happens to be slower.
E.g.
Speed up applying function to a list of pandas dataframes
This one with dates, so not really applicable How to speed up apply method with lambda in pandas with datetime
Joins, so not really applicable speed up pandas apply or using map
CodePudding user response:
Try with numpy.select
:
conditions = [df["city"].eq("London") & df["income"].gt(10000),
df["city"].isin(["Manchester", "Leeds"]),
df["borrower_age"].gt(50)]
choices = ["Group 1", "Group 2", "Group 3"]
df["New Field"] = np.select(conditions, choices, "Group 4")
Or have the conditions as a dictionary and use that in the np.select
:
conditions = {"Group 1": df1["city"].eq("London") & df1["income"].gt(10000),
"Group 2": df1["city"].isin(["Manchester", "Leeds"]),
"Group 3": df1["borrower age"].gt(50)}
df["New Field"] = np.select(conditions.values(), conditions.keys(), "Group 4")