How can I speed up my Python for loop that goes through rows of lists?-CodePudding

This is my current code

df['company_id'] = ''
length = 0
while length < len(df):
  for x in df:
    if df['associations.companies.results'][length] == 'nan':
      df.loc[df['associations.companies.results'] == 'nan', 'company_id'] = 0  
    else:
      df['company_id'][length] = df['associations.companies.results'][length][0]['id']
  length = length  1

I tried to run this code with Lambda and np.where versions, however, these gave errors that I couldn't solve. The data set has close to 40 rows and I try to get the company ID out of a dict nested in a list. It looks like this on each row:

[{'id': 'XXXXXXXXXX', 'type': 'call_to_company'}]

sometimes there is no company_id and it will look like:

nan

The final result would be a separate column called "company_id" that contains the 'id' value.

Right now the code has been running for 30 mins and still going strong

Hope anyone can help. Thanks!

CodePudding user response：

There are various improvements that you could make, but i'm still not entirely sure what kind of output you are expecting.

First of all you execute the len() function at each iteration, because you put it in the header of the while loop, this is an error, since you need to execute it only once.

Second: you have a double for loop (I think because you wanted to iterate both through indexes and for the elements), but this is a big error since this way you have a O(n^2) complexity instead of a O(n) one. You could've use enumerate(df) or simply use only the indexes

df['company_id'] = ''

for i in range(len(df)):
    if df['associations.companies.results'][i] == 'nan':
      df.loc[df['associations.companies.results'] == 'nan', 'company_id'] = 0  
    else:
      df['company_id'][i] = df['associations.companies.results'][i][0]['id']

I'm sure this could be further improved with lists comprehension or DataFrame .apply(), but I still don't understand your goal, so this is the most I can do.

If you've never heard before of Big-O notation I recommend you to read this

CodePudding user response：

Hope I understood your use case, so here is my idea:

Try using foreach and enumerate()! With this, you can totally avoid having a counter variable.

Like so:

df['company_id'] = ''
for i, x in enumerate(df):
    if df['associations.companies.results'][i] == 'nan':
        df.loc[df['associations.companies.results'] == 'nan', 'company_id'] = 0  

    else:
        df['company_id'][i] = df['associations.companies.results'][i][0]['id']

Sadly, your code is not so reproducible, so I hope I was able to understand