I am trying to understand the best practices of coding in python. I have a pandas dataframe for which I need to work on the columns that contains strings or floats, I am doing basic data management and I was wondering is it possible that a single for loop is faster than many list comprehensions.
In my case the target dataframe is 4mln or more lines and I'd have like 10 list comprehensions so speed is important and I have to decide whether to write it inside the for loop or many list comprehensions. Do you have suggestions?
for i in range(dataframe.shape[0]):
try: #Price dummy
if dataframe["Price"].iloc[i]=="0":
dataframe["Price_Dummy"].iloc[i] = 0
else:
dataframe["Price_Dummy"].iloc[i] = 1
except:
pass
try: #Transform everything in MB (middle unit)
unit_of_measure = dataframe["Size"].iloc[i].split(" ")[-1].lower()
size = float(dataframe["Size"].iloc[i].split(" ")[0])
if unit_of_measure =="kb":
dataframe["Size"].iloc[i] = size/1000
elif unit_of_measure =="gb":
dataframe["Size"].iloc[i] = size*1000
else:
dataframe["Size"].iloc[i] = size
except:
pass
(other 10 operations)
vs
the same in list comprehension
I have found this link: Single list iteration vs multiple list comprehensions
yet this doesn't say whether list comprehensions are always faster independently from the number of iterations considered
CodePudding user response:
I would try it without a loop using np.where
clauses for the if-elif-else combinations. That's usually pretty fast.
import numpy as np
# dataframe is a DataFrame containing data
# Now this:
dataframe["Price"] = np.where(dataframe["Price_Dummy"] == "0",0,1)
# String operations work on whole string columns as well
unit_of_measure = dataframe["Size"].str.split(" ", expand=True)[1].lower()
size = dataframe["Size"].str.split(" ", expand=True)[0].astype("float")
kb_case = np.where(unit_of_measure =="kb", size/1000, size)
dataframe["Size"] = np.where(unit_of_measure =="gb", size*1000, kb_case)
Notice that I replaced the [-1]
in the unit_of_measure =
line with [1]
as the expand=True
option does not support the -1
indexing. So you would have to know at which position your unit ends up.
Information on splitting strings in DataFrames can be found here.
In the last two lines, I reproduced the if-elif-else combination which you kind of have to create from the bottom up: Your final result dataframe["Size"]
equals size*1000
if the unit is gb
. If not, it equals the kb_case
which includes the case where the unit is kb
as well as all other cases.