I'm trying to build a machine learning algorithm for my job. The data I'm using for training and testing has 17k rows and 20 columns. I've tried adding a new column based on two other columns but the for loop that I've written is too slow (3 seconds to be executed)
for i in range(0, len(model_olculeri)):
if (model_olculeri["Bel"][i] != 0) and (model_olculeri["Basen"][i] != 0):
sum_column = (model_olculeri["Bel"][i]) / (model_olculeri["Basen"][i])
model_olculeri["Waist to Hip Ratio"][i] = sum_column
I read articles about pandas and numpy vectorization instead of for loop on pandas dataframes and it seems like it is so much faster and effective. How can I implement this kind of vectorization for my for loop? Thanks a lot.
CodePudding user response:
Create boolean mask and use it for filtering:
m = (model_olculeri["Bel"] != 0) & (model_olculeri["Basen"] != 0)
model_olculeri.loc[m,"Waist to Hip Ratio"] = model_olculeri.loc[m, "Bel"] / model_olculeri.loc[m,"Basen"]
Alternative:
model_olculeri.loc[m,"Waist to Hip Ratio"] = model_olculeri["Bel"] / model_olculeri["Basen"]
Or set new value in numpy.where
:
model_olculeri["Waist to Hip Ratio"] = np.where(m, model_olculeri["Bel"] / model_olculeri["Basen"], np.nan)
CodePudding user response:
Potential duplicate: How to deal with "divide by zero" with pandas dataframes when manipulating columns?
model_olculeri['Waist to Hip Ratio'] = model_olculeri['Bel'].divide(model_olculeri['Basen'])