Is there better way to iterate over nested loop for rows (30000)?-CodePudding

I would like to find rows in following table, which contain repeated email addresses. I create an extra column in the dataframe in the following code with value 'ja', when an email address is repeated. This is fine for a small number of rows (150). For large number of rows (30000), the script hangs. Any better ways to loop over the rows?

import pandas as pd
data={'Name':['Danny','Damny','Monny','Quony','Dimny','Danny'],
      'Email':['[email protected]','[email protected]','[email protected]','[email protected]','[email protected]','[email protected]']}
df=pd.DataFrame(data)
df['email_repeated']=None
col_email=df.columns.get_loc("Email")
row_count=len(df.index)
for i in range(0,row_count):
    for k in range(0,row_count):
        emailadres=df.iloc[i,col_email] 
        if k!=i:
            if emailadres==df.iloc[k,col_email]:
                df['email_repeated'][k] = 'ja'

CodePudding user response：

df.duplicated('Email', keep=False)

computes exactly what you want (in boolean form)

If you insist on having 'ja'/None, you can keep your initial column creation

df['email_repeated']=None
df.loc[dfOrg.duplicated('Email', keep=False), 'email_repeated']='ja'

As for the literal question (is there better way to iterate over pandas rows), generally speaking, the answer is "not to". The better way to iterate is to avoid iteration, at all cost. Of course, there is an iteration somewhere. duplicated surely iterate over the rows. But, it does it inside pandas code, in C, not inside your interpreted python code. It is very rare that you really need loops in dataframe. And it is a good attitude to think "If I am iterating over pandas rows, then I am doing something wrong". Even very convoluted "non-iterations" (I mean, succession of operations to achieve the result, when the algorithm seems straighforward using loops) are generally preferable to for loops.

In this case, it was not convoluted (there is a function dedicated exactly to your task). But even answers consisting in merging the dataframe with itself to find duplicates, or things like that would probably be way faster than anything with a for loop.

CodePudding user response：

You can use groupby transform count.

df['cnt'] = df.groupby('Email').Email.transform('count')
df['email_repeated'] = df.cnt > 1

CodePudding user response：

As a one-liner with numpy.where

df['email_repeated'] = np.where(df.duplicated("Email", keep=False), "ja", None)

CodePudding user response：

Your method is O(n²). For the case of 30,000 rows, it needs to do approximately 900,000,000 comparisons.

A quicker method would be to sort the data by email address, and then to check if an address is repeated you'd only have to compare it with the address in the immediate preceding and following rows. If you use a fast sort method, this would be O(n log(n)).