Say I have two pandas dataframes (df_a & df_b), where each row represents a toy and features about that toy. Some pretend features:
- Was_Sold (Y/N)
- Color
- Size_Group
- Shape
- Date_Made
Say df_a is relatively small (10s of thousands of rows) and df_b is relatively large (>1 million rows).
Then for every row in df_a, I want to:
- Find all the toys from df_b with the same type as the one from df_a (e.g. the same color group)
- The df_b toys must also be made before the given df_a toy
- Then find the ratio of those sold (So count sold / count all matched)
What is the most efficient means to make those per-row calculations above?
The best I've came up with so far is something like the below. (Note code might have an error or two as I'm rough typing from a different use case)
cols = ['Color', 'Size_Group', 'Shape']
# Run this calculation for multiple features
for col in cols:
print(col ' - Started')
# Empty list to build up the calculation in
ratio_list = []
# Start the iteration
for row in df_a.itertuples(index=False):
# Relevant values from df_a
relevant_val = getattr(row, col)
created_date = row.Date_Made
# df to keep the overall prior toy matches
prior_toys = df_b[(df_b.Date_Made < created_date) & (df_b[col] == relevant_val)]
prior_count = len(prior_toys)
# Now find the ones that were sold
prior_sold_count = len(prior_toys[prior_toys.Was_Sold == "Y"])
# Now make the calculation and append to the list
if prior_count == 0:
ratio = 0
else:
ratio = prior_sold_count / prior_count
ratio_list.append(ratio)
# Store the calculation in the original df_a
df_a[col '_Prior_Sold_Ratio'] = ratio_list
print(col ' - Finished')
Using .itertuples()
is useful, but this is still pretty slow. Is there a more efficient method or something I'm missing?
EDIT Added the below script which will emulated data for the above scenario:
import numpy as np
import pandas as pd
colors = ['red', 'green', 'yellow', 'blue']
sizes = ['small', 'medium', 'large']
shapes = ['round', 'square', 'triangle', 'rectangle']
sold = ['Y', 'N']
size_df_a = 200
size_df_b = 2000
date_start = pd.to_datetime('2015-01-01')
date_end = pd.to_datetime('2021-01-01')
def random_dates(start, end, n=10):
start_u = start.value//10**9
end_u = end.value//10**9
return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')
df_a = pd.DataFrame(
{
'Color': np.random.choice(colors, size_df_a),
'Size_Group': np.random.choice(sizes, size_df_a),
'Shape': np.random.choice(shapes, size_df_a),
'Was_Sold': np.random.choice(sold, size_df_a),
'Date_Made': random_dates(date_start, date_end, n=size_df_a)
}
)
df_b = pd.DataFrame(
{
'Color': np.random.choice(colors, size_df_b),
'Size_Group': np.random.choice(sizes, size_df_b),
'Shape': np.random.choice(shapes, size_df_b),
'Was_Sold': np.random.choice(sold, size_df_b),
'Date_Made': random_dates(date_start, date_end, n=size_df_b)
}
)
CodePudding user response:
First of all, I think your computation would be much more efficient using relational database and SQL query. Indeed, the filters can be done by indexing columns, performing a database join, some advance filtering and count the result. An optimized relational database can generate an efficient algorithm based on a simple SQL query (hash-based row grouping, binary search, fast intersection of sets, etc.). Pandas is sadly not very good to perform efficiently advanced requests like this. It is also very slow to iterate over pandas dataframe although I am not sure this can be alleviated in this case using only pandas. Hopefully you can use some Numpy and Python tricks and (partially) implement what fast relational database engines would do.
Additionally, pure-Python object types are slow, especially (unicode) strings. Thus, **converting column types to efficient ones in a first place can save a lot of time (and memory). For example, there is no need for the Was_Sold
column to contains "Y"/"N" string objects: a boolean can just be used in that case. Thus let us convert that:
df_b.Was_Sold = df_b.Was_Sold == "Y"
Finally, the current algorithm has a bad complexity: O(Na * Nb)
where Na
is the number of rows in df_a
and Nb
is the number of rows in df_b
. This is not easy to improve though due to the non-trivial conditions. A first solution is to group df_b
by col
column ahead-of-time so to avoid an expensive complete iteration of df_b
(previously done with df_b[col] == relevant_val
). Then, the date of the precomputed groups can be sorted so to perform a fast binary search later. Then you can use Numpy to count boolean values efficiently (using np.sum
).
Note that doing prior_toys['Was_Sold']
is a bit faster than prior_toys.Was_Sold
.
Here is the resulting code:
cols = ['Color', 'Size_Group', 'Shape']
# Run this calculation for multiple features
for col in cols:
print(col ' - Started')
# Empty list to build up the calculation in
ratio_list = []
# Split df_b by col and sort each (indexed) group by date
colGroups = {grId: grDf.sort_values('Date_Made') for grId, grDf in df_b.groupby(col)}
# Start the iteration
for row in df_a.itertuples(index=False):
# Relevant values from df_a
relevant_val = getattr(row, col)
created_date = row.Date_Made
# df to keep the overall prior toy matches
curColGroup = colGroups[relevant_val]
prior_count = np.searchsorted(curColGroup['Date_Made'], created_date)
prior_toys = curColGroup[:prior_count]
# Now find the ones that were sold
prior_sold_count = prior_toys['Was_Sold'].values.sum()
# Now make the calculation and append to the list
if prior_count == 0:
ratio = 0
else:
ratio = prior_sold_count / prior_count
ratio_list.append(ratio)
# Store the calculation in the original df_a
df_a[col '_Prior_Sold_Ratio'] = ratio_list
print(col ' - Finished')
This is 5.5 times faster on my machine.
The iteration of the pandas dataframe is a major source of slowdown. Indeed, prior_toys['Was_Sold']
takes half the computation time because of the huge overhead of pandas internal function calls repeated Na
times... Using Numba may help to reduce the cost of the slow iteration. Note that the complexity can be increased by splitting colGroups
in subgroups ahead of time (O(Na log Nb)
). This should help to completely remove the overhead of prior_sold_count
. The resulting program should be about 10 time faster than the original one.