I have a dataframe with 3 million rows. I need to transform the values in a column. The column contains strings joined together with ";". The transformation involves breaking up the string into its components and then choosing one of the strings based on some priority rules.
Here is the sample dataset and the function:
data = {'Name': ['X1', 'X2', 'X3', 'X4', 'X5','X6'], 'category': ['CatA;CatB', 'CatB', None, 'CatB;CatC;CatA', 'CatA;CatB', 'CatB;CatD;CatB;CatC;CatA']}
sample_dataframe = pd.DataFrame(data)
def cat_name(x):
if x:
x = pd.Series(x.split(";"))
y = x[(x!='CatA') & x.notna()]
custom_dict = {'CatC': 0, 'CatD':1, 'CatB': 2, 'CatE': 3}
if x.count() == 1:
return x.iloc[0]
elif y.count() > 1:
y = y.sort_values(key=lambda x: x.map(custom_dict))
if y.count() > 2:
return '3 or more'
else:
return y.iloc[0] ' '
elif y.count() == 1:
return y.iloc[0]
else:
return None
else:
return None
I am using the apply method test_data = sample_dataframe['category'].apply(cat_name)
to run the function on the column. For my dataset of 3 million rows, the function takes almost 10 minutes to run.
How can I optimize the function to run faster?
Also, I have two set of of category rules and the output category needs to be stored in two columns. Currently I am using the apply function twice. Kinda dumb and slow, I know, but it works.
Is there a way to run the function at the same time for a different priority dictionary and return two output values? I tried to use
test_data['CAT_NAME'], test_data['MAIN_CAT_NAME']=zip(*sample_dataframe['category'].apply(joint_cat_name))
with the function
def joint_cat_name(x):
cat_string = x
if cat_string:
string_series = pd.Series(cat_string.split(";"))
y = string_series[(string_series!='CatA') & string_series.notna()]
custom_dict = {'CatB': 0, 'CatC':1, 'CatD': 2, 'CatE': 3}
if string_series.count() == 1:
return string_series.iloc[0], string_series.iloc[0]
elif y.count() > 1:
y = y.sort_values(key=lambda x: x.map(custom_dict))
if y.count() > 2:
return '3 or more', y.iloc[0]
elif y.count() == 1:
return y.iloc[0] ' ', y.iloc[0]
elif y.count() == 1:
return y.iloc[0], y.iloc[0]
else:
return None, None
else:
return None, None
But I got an error TypeError: 'NoneType' object is not iterable
when the zip function encountered tuple containing Nones. ie it threw an error when output was (None, None)
Thanks a lot in advance.
CodePudding user response:
Your function does a lot of unnecessary work. Even if you just reorder some conditionals it will run much faster.
custom_dict = {"CatC": 0, "CatD": 1, "CatB": 2, "CatE": 3}
def cat_name(x):
if x is None:
return x
xs = x.split(";")
if len(xs) == 1:
return xs[0]
ys = [x for x in xs if x != "CatA"]
l = len(ys)
if l == 0:
return None
if l == 1:
return ys[0]
if l == 2:
return min(ys, key=lambda k: custom_dict[k]) " "
if l > 2:
return "3 or more"
CodePudding user response:
Faster than running one Python method on each row might be to go through your dataframe multiple times, and each time use an optimized Pandas query. You'd have to rewrite your code something like this:
# select empty categories
no_cat = sample_dataframe['category'].isna()
# select categorie strings with only one category
single_cat = ~no_cat & (sample_dataframe['category'].str.count(";") == 0)
# get number of categories
num_cats = sample_dataframe['category'].str.count(";") 1
three_or_more = num_cats > 2
# has a "CatA" category
has_cat_A = sample_dataframe['category'].str.contains("CatA", na=False)
# then also write these selected rows in a custom way
sample_dataframe["cat_name"] = ""
cat_name_col = sample_dataframe["cat_name"]
cat_name_col[no_cat] = None
cat_name_col[single_cat] = sample_dataframe["category"][single_cat]
cat_name_col[three_or_more] = "3 or more"
# continue with however complex you want to get to cover more cases, e.g.
two_cats_no_cat_A = (num_cats == 2) & ~has_cat_A
# then handle only the remaining cases with the apply
not_handled = ~no_cat & ~single_cat & ~three_or_more
cat_name_col[not_handled] = sample_dataframe["category"][not_handled].apply(cat_name)
Running these queries on 3 million rows should be plenty fast, even if you have to do a few of them and combine them. If it's still too slow, you can handle more special cases from the apply in the same vectorized fashion.