Goal:
- Compare 2 CSV files (Pandas DataFrames)
- If
user_id
value matches in rows, add values ofcountry
andyear_of_birth
columns from one DataFrame into corresponding row/columns in second DataFrame - Create new CSV file from resulting "full" (updated) DataFrame
The below code works, but it takes a LONG time when the CSV files are large. I have to imagine there is a better way to do this, I just haven't been able to figure it out.
Current code:
#!/usr/bin/env python
# Imports
import pandas as pd
# Variables / Arrays
import_optin_file_name = "full_data.csv"
import_extravars_file_name = "id-country-dob.csv"
export_file_name = "new_list.csv"
# Create DataFrames from 2 imported CSV files
optin_infile = pd.read_csv(import_optin_file_name)
extravars_infile = pd.read_csv(import_extravars_file_name)
# Create/Insert new columns to "optin_infile" dataframe (country,year_of_birth) with initial value of NULL
optin_infile.insert(loc=7, column='country', value='NULL')
optin_infile.insert(loc=8, column='year_of_birth', value='NULL')
# Iterate through rows, compare user_id - if match, add value to country, year_of_birth columns
for ev_index, ev_row in extravars_infile.iterrows():
for optin_index, optin_row in optin_infile.iterrows():
if ev_row['user_id'] == optin_row['user_id']:
optin_infile.loc[optin_index, 'country'] = ev_row['country']
optin_infile.loc[optin_index, 'year_of_birth'] = ev_row['year_of_birth']
# export to new CSV file
optin_infile.to_csv(export_file_name, header=True, index=False)
Any suggestions/examples of how to get this to work better with large csv files are greatly appreciated! Thanks!
UPDATE 02/28/2022
My comment response to one of the answers did not come out clear, so I am posting the update here. I used Raymond Kwok
's answer, and got it to work much faster. I do have one minor data issue. During the merge, it seems that the year_of_birth
column changes type? Meaning, if the value is originally 1987
, after the merge, in the new DataFrame, it ends up being 1987.0
.
My updated code:
#!/usr/bin/env python
# Imports
import pandas as pd
# Variables / Arrays
import_optin_file_name = "full_data.csv"
import_extravars_file_name = "id-country-dob.csv"
export_file_name = "new_list.csv"
# Create DataFrames from 2 imported CSV files
optin_infile = pd.read_csv(import_optin_file_name)
extravars_infile = pd.read_csv(import_extravars_file_name)
# Merge extravars into optin when user_id matches
outcome = optin_infile.merge(
extravars_infile[['user_id', 'country', 'year_of_birth']],
on='user_id', how='left'
)
# Replace blank (no) values with NULL
outcome.fillna('NULL', inplace=True)
# Export to new CSV file
outcome.to_csv(export_file_name, index=False)
To try and solve the year_of_birth
data type issue, i added this line before the merge - extravars_infile['year_of_birth'] = extravars_infile['year_of_birth'].astype('object')
- This works fine for small CSV files, but not for larger ones.
Any thoughts? Thanks for the help!
** UPDATE - data type issue fix **
Raymond Kwok
to the rescue again. Added the following line to this section:
# Replace blank (no) values with NULL
outcome['year_of_birth'] = outcome['year_of_birth'].fillna(-1).astype(int).astype(str).replace('-1', ' NULL')
outcome.fillna('NULL', inplace=True)
this solved the 1987.0
issue. Thanks so much.
Next issue is that the resulting CSV file is ~2 million rows longer than the original file - the row count should match, which means it is duplicating rows. I am looking into this, I am suspecting that it is not handling double entries of the same user_id
in both files properly. This was one of the reasons I wanted to go row by row and check (original code), but alas, that doesn't scale too well with large files. Any further help is appreciated, and I will post here if I end up confirming the suspected problem and figure out a way to work-around or clean up the result.
Thanks!!
** UPDATE - duplicated lines in merge (kind-of) fix **
As suspected, the issue happened because both files had duplicate rows with the same user_id
. This is expected, but with my original row-by-row comparison, I didn't have to worry about it, because it would never add rows anyways. With merge
, however, it must read it differently, so it produced multiple duplicate rows. I solved this by "cleaning" one of the files (labeled import_extravars_file_name
in the code) beforehand, essentially grouping and reducing it to one row per multiple user_id
values.
The reason I say "kind of" a fix is because I don't know if there is a better way to do this, within the confines of the merge statement itself.
But ultimately, I do have a solution that works.
Thanks all for the help, and especially Raymond Kwok
, who supplied the merge code for me to start down the path to success.
Thanks!!
CodePudding user response:
You needed merge
, moreover, you need to make sure the 'user_id'
columns from both dataframes have the same data type for merge
to be able to identify matches. For example, 123
as integer won't match with '123'
as string.
outcome = optin_infile.merge(
extravars_infile[['user_id', 'country', 'year_of_birth']],
on='user_id', how='left'
)
outcome.to_csv(export_file_name, index=False)
If you are wondering why your attempts on using merge
didn't work, after making sure about the datatype, then, share your code that involves the failed merge
, and share minimum number of rows of each of your dataframes so that we can reproduce your failure attempt in our computer and investigate for the reason.