Home > Blockchain >  If condition in for loop messing up pandas dataframe
If condition in for loop messing up pandas dataframe

Time:11-05

I am trying to mutate a data frame in a for loop based on an if condition.

import pandas as pd
#read the read.csv file

def write_csv(email, info):
    df1 = pd.read_csv('/Users/thavas/Downloads/write.csv')
    #iterate over df1 and look for email
    for index, row in df1.iterrows():
        print(index)
        if row['Email 1'] == email:
            #change the value of column 8 to "hello"
            df1.loc[index,'Email 2'] = "hello"
        df1.to_csv('/Users/thavas/Downloads/out.csv')
            

                
def read_csv():
    #iterate over rows in dataframe
    df = pd.read_csv('/Users/thavas/Downloads/read.csv')
    for index,rows in df.iterrows():
        email = rows['Email']
        #if email available
        if email != 'nan':
            #get column 8,9,10,11 of rows
            info = rows[8:17]
            write_csv(email, info)
        
        else:
            print("Users", rows['Contact'], "has no email")
            
read_csv()

However I am running into an error where with the if statements, no data is being added to the csv file.

Doing some debugging, I realized that by putting print statements inside the if statements I get many outputs. So getting in the if statements is not a problem.

Additionally, after taking out all the if statements, I see outputs into my csv file. What could be going wrong?

UPDATE I noticed that only the last loop of data is being read and updated to my out file.

What does this mean?

CodePudding user response:

I've simplified the question to make my explanation easier. I've used only three columns in read.csv (eid, name, and email) and write.csv has (email1, and email2).

The goal is to create out.csv which is a copy of write.csv with the email2 field filled up for all emails that exist in read.csv.

The following code matches most of your logic, and I'll use it to highlight some issues.

import pandas as pd

# printing just so you can see the structure explained above
rdf = pd.read_csv('./read.csv')
wdf = pd.read_csv('./write.csv')

print("read.csv")
print(rdf)

print("write.csv")
print(wdf)


def write_csv(email):
    df1 = pd.read_csv('./write.csv')
    #iterate over df1 and look for email
    for index, row in df1.iterrows():
        if row['email1'] == email:
            #change the value of column 8 to "hello"
            df1.loc[index,'email2'] = "hello"
        df1.to_csv('./out.csv')           # <---- Reference 2
        #              ^ Reference 3

def read_csv():
    #iterate over rows in dataframe
    df = pd.read_csv('./read.csv')
    for index,rows in df.iterrows():
        email = rows['email']
        #if email available
        if email != 'nan': # <-- Reference 1
            write_csv(email) # <-- Reference 4

read_csv()

And the output

(stackoverflow) /tmp $ python test.py

read.csv
   eid    name                   email
0    1     Tom   [email protected]
1    2   Alice
2    3     Bob   [email protected]
3    4     Tim

write.csv
                  email1           email2
0  [email protected]   email2-unknown
1  [email protected]   email2-unknown

(stackoverflow) /tmp $ cat out.csv
,email1,email2
0,[email protected], email2-unknown
1,[email protected], email2-unknown

Reference 1:

This really depends on your csv file and what you expect to find there. If empty emails are actually represented in your file as 'nan' then this seems fine. However, in my case the file actually has nothing ('') for missing emails. So pandas makes them NaN. However, you can't compare it directly with 'nan'.

Possible solutions

  1. if str(email) == 'nan' as str(nan) is 'nan'
  2. if isinstance(email, float) and np.isnan(email) where np.isnan is a function from numpy but that only works on floats. Giving it a string will cause an error.
  3. Convert the entire column to string in the beginning and then iterate over the rows. pd.read_csv('./read.csv').astype({'email':'str'}). You can then just compare email == 'nan' like before.
  4. Only iterate over non-nan values. i.e. for index, rows in df[~df.email.isna()].iterrows() where df.email.isna() checks if something is NaN and the ~ negates that. So you get all rows where email is not NaN.

With this, you are guaranteed that write_csv will be called correctly.

Reference 2:

Writing out.csv on every iteration is costly. If write.csv has a 1000 rows, you'll be writing out.csv a 1000 times. Instead, you'll want to write out.csv only once outside the for loop.

Reference 3 and 4:

When you find a non-empty email from read.csv, write_csv is called for a specific email. For example, lets take the first row, [email protected]. So now write_csv reads write.csv, modifies all instances where email is tom... and then writes it to out.csv.

Except that when it moves to the next non-empty email, bob..., you again read from write.csv which overwrites out.csv. write.csv did not have the changes you made for tom. This explains why you'll only see the last email.

Possible solutions:

  1. A quick fix is to first copy write.csv to out.csv (before read_csv is called) and modify the write_csv to always read out.csv and write to out.csv too.

Putting it all together:

import pandas as pd
import numpy as np

rdf = pd.read_csv('./read.csv')
wdf = pd.read_csv('./write.csv')

print("read.csv")
print(rdf)

print("write.csv")
print(wdf)


def write_csv(email):
    df1 = pd.read_csv('./out.csv') # read from out
    #iterate over df1 and look for email
    for index, row in df1.iterrows():
        if row['email1'] == email:
            #change the value of column 8 to "hello"
            df1.loc[index,'email2'] = "hello"

    # outside for loop
    # write to out, so next call with different email also persists
    df1.to_csv('./out.csv') 

def read_csv():
    #iterate over rows in dataframe
    df = pd.read_csv('./read.csv')
    # iterate over rows with non-nan emails
    for index,rows in df[~df.email.isna()].iterrows():
        # I've removed the info part with other columns but you get the idea...
        email = rows['email']
        write_csv(email)

import shutil
shutil.copyfile('./write.csv', './out.csv') # first copy to out
read_csv()

The above will give you a working solution but is still quite inefficient since you are writing out.csv for every single email that exists.

Instead, think about creating out.csv by merging/joining the dataframes from read.csv and write.csv. Here are some links to the docs merge and join. Maybe also look at some examples from stackoverflow questions on pandas joins and merges.

  • Related