Faster alternative to DataFrame.iterrows()-CodePudding

data = {'valueA': [-4.213762047, -1.408432067, -0.557258073, -0.213845329, -6.037197312, -0.454789086, -4.816195544,
                   -1.488329919, -4.816195544, -2.301026382, -2.170823094, -5.634475979, -1.189035254, -6.04771587,
                   -5.154490843, -0.667055058, -0.777983333, -0.411221206, -4.213762047, -1.545829201, -4.885603295,
                   -4.816195544, -5.349995991, -0.180029018, -5.348722537, -1.192672691, -1.771607678, -2.261599892,
                   -1.101539885, -1.576736607],
        'valueB': [0.443326304, -5.804335179, -4.45399397, -3.856127364, -5.633162559, 0.08647866, -0.63803755,
                   -0.152266115, -0.211527698, -1.591288764, -0.363876102, -1.824133039, -0.363876102, -0.196084931,
                   -6.038875528, -0.815608686, -3.108092785, -4.213762047, 0.370978551, -4.20366352, -4.213762047,
                   -0.200344922, 0.359609993, -2.130887094, 0.391159162, -3.156335855, -5.446335778, -4.213762047,
                   -4.530705882, -5.227893129]}

dataMean = -0.16243941958211103
dataSTDev = 0.0013394450126162961
dataSet = pd.DataFrame(data)


dataSet['rowValueFlag'] = [
    row['valueA']   row['valueB'] <= dataMean   2 * dataSTDev
    for item, row in dataSet.iterrows()
]

I use the above loop to go over a DataFrame and check that the sum of two items ('valueA' and 'valueB') falls at or below 2 standard deviations above the mean of my data. This loop then assigns the True and False values to a new column in the DataFrame called 'rowValueFlag'.

My main problem here is that my datasets have 500k items this loop is prohibitively slow. Just to make sure everything was actually working I tried cutting my dataset down to only 10k rows and it took about 10 seconds to successfully complete the operation.

I know that using .iterrows() is inefficient, but I don't know how to apply any of the faster alternatives for what I am trying to do. Can anyone provide a more efficient alternative to the approach that I am using?

Thank you so much for the help!! The vectorized code runs much faster, but I was wondering, is it possible to do something like this using and/or statements? For example could I change this iterrows() list comprehension...

dataSet['rowValueFlagB'] = [
            (row['valueA'] <= dataMean  2 * dataSTDev or row[
                'valueB'] <= dataMean  2 * dataSTDev)
            for item, row in dataSet.iterrows()
]

into a vectorized operation like this?

# Not functional code. Results in a ValueError 
dataSet['rowValueFlagB'] = (dataSet['valueA'] <= dataMean   2 * dataSTDev) or (dataSet['valueB'] <= dataMean   2 * dataSTDev)

CodePudding user response：

If you want speed, use vectorized code:

dataSet["rowValueFlag"] = dataSet["valueA"]   dataSet["valueB"] <= dataMean   2 * dataSTDev