Alternative to iterating dataframe while keeping track of the current index-CodePudding

I am working with large dataframes and have noticed that it takes a long time to iterate through each dataframe using df.iterrows(). Currently, I iterate through the rows of a dataframe, extract the values of certain rows in my dataframe and multiply them by some predefined weights. Then I create a confidence level, and if it is greater than a certain threshold, I add the index to the list indices. Here is a simple example of what I mean:

import pandas as pd

attributes = ['attr1', 'attr2', 'attr3']
d = {'attr1': [1, 2], 'attr2': [3, 4], 'attr3' : [5, 6], 'meta': ['foo', 'bar']}
df = pd.DataFrame(data=d)

indices = []
threshold = 0.5
for index, row in df.iterrows():
  weights = [0.3 , 0.3, 0.4]
  results = []
  for attr in attributes:
    if attr == 'attr1':
      results.append(row[attr] * 5)
    else:
      results.append(row[attr])
  confidence_level = sum(list(map(lambda x, y: x * y, results, weights))) / len(results)
  if confidence_level >= threshold:
    indices.append(index)

My question is if there is a way to get rid of the first loop, while still keeping track of the indices in the dataframe? The inner loop should, if possible, remain as it is, since it contains a condition.

CodePudding user response：

That's perfectly vectorizable:

weighted_attrs = df[attributes] * weights / len(weights)
# honestly, it'd be more logical to adjust weights instead
weighted_attrs['attr1'] *= 5
confidence_levels = weighted_attrs.sum(axis=1)
indices = df.index[confidence_levels > threshold]

CodePudding user response：

Iterating over panda dataframes is indeed slow and should be avoided. You could use df.apply() to apply a function for each row. If you do this to get the confidence levels per row and select only the rows that have a confidence level above the threshold you should get what you want.