data = {'valueA': [-4.213762047, -1.408432067, -0.557258073, -0.213845329, -6.037197312, -0.454789086, -4.816195544,
-1.488329919, -4.816195544, -2.301026382, -2.170823094, -5.634475979, -1.189035254, -6.04771587,
-5.154490843, -0.667055058, -0.777983333, -0.411221206, -4.213762047, -1.545829201, -4.885603295,
-4.816195544, -5.349995991, -0.180029018, -5.348722537, -1.192672691, -1.771607678, -2.261599892,
-1.101539885, -1.576736607],
'valueB': [0.443326304, -5.804335179, -4.45399397, -3.856127364, -5.633162559, 0.08647866, -0.63803755,
-0.152266115, -0.211527698, -1.591288764, -0.363876102, -1.824133039, -0.363876102, -0.196084931,
-6.038875528, -0.815608686, -3.108092785, -4.213762047, 0.370978551, -4.20366352, -4.213762047,
-0.200344922, 0.359609993, -2.130887094, 0.391159162, -3.156335855, -5.446335778, -4.213762047,
-4.530705882, -5.227893129]}
dataMean = -0.16243941958211103
dataSTDev = 0.0013394450126162961
dataSet = pd.DataFrame(data)
dataSet['rowValueFlag'] = [
row['valueA'] row['valueB'] <= dataMean 2 * dataSTDev
for item, row in dataSet.iterrows()
]
I use the above loop to go over a DataFrame and check that the sum of two items ('valueA'
and 'valueB'
) falls at or below 2 standard deviations above the mean of my data. This loop then assigns the True
and False
values to a new column in the DataFrame called 'rowValueFlag'
.
My main problem here is that my datasets have 500k items this loop is prohibitively slow. Just to make sure everything was actually working I tried cutting my dataset down to only 10k rows and it took about 10 seconds to successfully complete the operation.
I know that using .iterrows() is inefficient, but I don't know how to apply any of the faster alternatives for what I am trying to do. Can anyone provide a more efficient alternative to the approach that I am using?
Thank you so much for the help!! The vectorized code runs much faster, but I was wondering, is it possible to do something like this using and/or statements? For example could I change this iterrows() list comprehension...
dataSet['rowValueFlagB'] = [
(row['valueA'] <= dataMean 2 * dataSTDev or row[
'valueB'] <= dataMean 2 * dataSTDev)
for item, row in dataSet.iterrows()
]
into a vectorized operation like this?
# Not functional code. Results in a ValueError
dataSet['rowValueFlagB'] = (dataSet['valueA'] <= dataMean 2 * dataSTDev) or (dataSet['valueB'] <= dataMean 2 * dataSTDev)
CodePudding user response:
If you want speed, use vectorized code:
dataSet["rowValueFlag"] = dataSet["valueA"] dataSet["valueB"] <= dataMean 2 * dataSTDev