Python - Better way to compare 2 csv files and add 2 new columns if condition met?-CodePudding

Goal:

Compare 2 CSV files (Pandas DataFrames)
If user_id value matches in rows, add values of country and year_of_birth columns from one DataFrame into corresponding row/columns in second DataFrame
Create new CSV file from resulting "full" (updated) DataFrame

The below code works, but it takes a LONG time when the CSV files are large. I have to imagine there is a better way to do this, I just haven't been able to figure it out.

Current code:

#!/usr/bin/env python

# Imports
import pandas as pd

# Variables / Arrays
import_optin_file_name = "full_data.csv"
import_extravars_file_name = "id-country-dob.csv"
export_file_name = "new_list.csv"

# Create DataFrames from 2 imported CSV files
optin_infile = pd.read_csv(import_optin_file_name)
extravars_infile = pd.read_csv(import_extravars_file_name)

# Create/Insert new columns to "optin_infile" dataframe (country,year_of_birth) with initial value of NULL
optin_infile.insert(loc=7, column='country', value='NULL')
optin_infile.insert(loc=8, column='year_of_birth', value='NULL')

# Iterate through rows, compare user_id - if match, add value to country, year_of_birth columns
for ev_index, ev_row in extravars_infile.iterrows():
    for optin_index, optin_row in optin_infile.iterrows():
        if ev_row['user_id'] == optin_row['user_id']:
            optin_infile.loc[optin_index, 'country'] = ev_row['country']
            optin_infile.loc[optin_index, 'year_of_birth'] = ev_row['year_of_birth']

# export to new CSV file
optin_infile.to_csv(export_file_name, header=True, index=False)

Any suggestions/examples of how to get this to work better with large csv files are greatly appreciated! Thanks!

UPDATE 02/28/2022

My comment response to one of the answers did not come out clear, so I am posting the update here. I used Raymond Kwok's answer, and got it to work much faster. I do have one minor data issue. During the merge, it seems that the year_of_birth column changes type? Meaning, if the value is originally 1987, after the merge, in the new DataFrame, it ends up being 1987.0.

My updated code:

#!/usr/bin/env python

# Imports
import pandas as pd

# Variables / Arrays
import_optin_file_name = "full_data.csv"
import_extravars_file_name = "id-country-dob.csv"
export_file_name = "new_list.csv"

# Create DataFrames from 2 imported CSV files
optin_infile = pd.read_csv(import_optin_file_name)
extravars_infile = pd.read_csv(import_extravars_file_name)

# Merge extravars into optin when user_id matches
outcome = optin_infile.merge(
    extravars_infile[['user_id', 'country', 'year_of_birth']], 
    on='user_id', how='left'
)

# Replace blank (no) values with NULL
outcome.fillna('NULL', inplace=True)

# Export to new CSV file
outcome.to_csv(export_file_name, index=False)

To try and solve the year_of_birth data type issue, i added this line before the merge - extravars_infile['year_of_birth'] = extravars_infile['year_of_birth'].astype('object') - This works fine for small CSV files, but not for larger ones.

Any thoughts? Thanks for the help!

** UPDATE - data type issue fix ** Raymond Kwok to the rescue again. Added the following line to this section:

# Replace blank (no) values with NULL
outcome['year_of_birth'] = outcome['year_of_birth'].fillna(-1).astype(int).astype(str).replace('-1', ' NULL')
outcome.fillna('NULL', inplace=True)

this solved the 1987.0 issue. Thanks so much.

Next issue is that the resulting CSV file is ~2 million rows longer than the original file - the row count should match, which means it is duplicating rows. I am looking into this, I am suspecting that it is not handling double entries of the same user_id in both files properly. This was one of the reasons I wanted to go row by row and check (original code), but alas, that doesn't scale too well with large files. Any further help is appreciated, and I will post here if I end up confirming the suspected problem and figure out a way to work-around or clean up the result.

Thanks!!

** UPDATE - duplicated lines in merge (kind-of) fix ** As suspected, the issue happened because both files had duplicate rows with the same user_id. This is expected, but with my original row-by-row comparison, I didn't have to worry about it, because it would never add rows anyways. With merge, however, it must read it differently, so it produced multiple duplicate rows. I solved this by "cleaning" one of the files (labeled import_extravars_file_name in the code) beforehand, essentially grouping and reducing it to one row per multiple user_id values.

The reason I say "kind of" a fix is because I don't know if there is a better way to do this, within the confines of the merge statement itself.

But ultimately, I do have a solution that works.

Thanks all for the help, and especially Raymond Kwok, who supplied the merge code for me to start down the path to success.

Thanks!!

CodePudding user response：

You needed merge, moreover, you need to make sure the 'user_id' columns from both dataframes have the same data type for merge to be able to identify matches. For example, 123 as integer won't match with '123' as string.

outcome = optin_infile.merge(
    extravars_infile[['user_id', 'country', 'year_of_birth']], 
    on='user_id', how='left'
)

outcome.to_csv(export_file_name, index=False)

If you are wondering why your attempts on using merge didn't work, after making sure about the datatype, then, share your code that involves the failed merge, and share minimum number of rows of each of your dataframes so that we can reproduce your failure attempt in our computer and investigate for the reason.