Home > database >  Multiple list comprehension vs single for loop
Multiple list comprehension vs single for loop

Time:11-11

I am trying to understand the best practices of coding in python. I have a pandas dataframe for which I need to work on the columns that contains strings or floats, I am doing basic data management and I was wondering is it possible that a single for loop is faster than many list comprehensions.

In my case the target dataframe is 4mln or more lines and I'd have like 10 list comprehensions so speed is important and I have to decide whether to write it inside the for loop or many list comprehensions. Do you have suggestions?

for i in range(dataframe.shape[0]):
        try: #Price dummy
            if dataframe["Price"].iloc[i]=="0":
                dataframe["Price_Dummy"].iloc[i] = 0
            else:
                dataframe["Price_Dummy"].iloc[i] = 1
        except:
            pass
        try: #Transform everything in MB (middle unit)
            unit_of_measure = dataframe["Size"].iloc[i].split(" ")[-1].lower()
            size = float(dataframe["Size"].iloc[i].split(" ")[0])
            if unit_of_measure =="kb":
                dataframe["Size"].iloc[i] = size/1000
            elif unit_of_measure =="gb":
                dataframe["Size"].iloc[i] = size*1000
            else:
                dataframe["Size"].iloc[i] = size
        except:
            pass

(other 10 operations)

vs

the same in list comprehension

I have found this link: Single list iteration vs multiple list comprehensions

yet this doesn't say whether list comprehensions are always faster independently from the number of iterations considered

CodePudding user response:

I would try it without a loop using np.where clauses for the if-elif-else combinations. That's usually pretty fast.

import numpy as np

# dataframe is a DataFrame containing data
# Now this:

dataframe["Price"] = np.where(dataframe["Price_Dummy"] == "0",0,1)

# String operations work on whole string columns as well
unit_of_measure = dataframe["Size"].str.split(" ", expand=True)[1].lower()

size = dataframe["Size"].str.split(" ", expand=True)[0].astype("float")

kb_case = np.where(unit_of_measure =="kb", size/1000, size)
dataframe["Size"] = np.where(unit_of_measure =="gb", size*1000, kb_case)

Notice that I replaced the [-1] in the unit_of_measure = line with [1] as the expand=True option does not support the -1 indexing. So you would have to know at which position your unit ends up.

Information on splitting strings in DataFrames can be found here.

In the last two lines, I reproduced the if-elif-else combination which you kind of have to create from the bottom up: Your final result dataframe["Size"] equals size*1000 if the unit is gb. If not, it equals the kb_case which includes the case where the unit is kb as well as all other cases.

  • Related