I have 2 files, file1.csv and a large csv called master_file.csv. They have several columns and have a common column name called EMP_Code.
file1 example:
EMP_name | EMP_Code | EMP_dept |
---|---|---|
a | s283 | abc |
b | f367 | xyz |
d | c264 | abc |
master_file.csv example:
|EMP_name | EMP_age | EMP_Service | EMP_Code | EMP_dept |
|:------- | :-----: | :---------: |:-------: | :--------|
|a | 30 | 6 | s283 | abc |
|b | 29 | 3 | f367 | xyz |
|r | 27 | 1 | g364 | lmn |
|d | 45 | 10 | c264 | abc |
|t | 50 | 25 | t453 | lmn |
I want to extract similar rows from master_file using all the EMP_Code values in file1. I tried the following code and received the error. I cannot read the complete master csv file as it is around 20gb, has lakhs of rows and running out of memory. I want to read the master_file in chunks and extract the complete rows for each of the EMP_Code present in file1 and save it into new file Employee_full_data.
import pandas as pd
import numpy as np
import os
from pandas import DataFrame
df = pd.read_csv(r"file1.csv")
df_items = pd.read_csv(r"master_file.csv", names=['EMP_Code'], iterator=True, chunksize=10000)
article_usage = df.loc[df['EMP_Code'] .isin(df_items['EMP_Code'])]
article_usage.to_csv(r"Employee_full_data.csv")
TypeError Traceback (most recent call last)
/tmp/ipykernel_117278/3252874549.py in <module>
6 df = pd.read_csv(r"file1.csv")
7 df_items = pd.read_csv(r"master_file.csv", names=['EMP_Code'], iterator=True, chunksize=10000)
----> 8 article_usage = df.loc[df['EMP_Code'] .isin(df_items['EMP_Code'])]
9 article_usage.to_csv(r"Employee_full_data.csv")
TypeError: 'TextFileReader' object is not subscriptable
CodePudding user response:
One solution is to use the built-in csv
library for better memory handling. Although it prevents the out-of-memory issue, it might take some time to complete due to using for loop to scan the rows one by one.
Assuming these two sample files:
file1.csv:
EMP_name,EMP_Code,EMP_dept
a,s283,abc
b,f367,xyz
d,c264,abc
master_file.csv:
EMP_name,EMP_age,EMP_Service,EMP_Code,EMP_dept
a,30,6,s283,abc
b,29,3,f367,xyz
r,27,1,g364,lmn
d,45,10,c264,abc
t,50,25,t453,lmn
You can use the following code snippet (use the commented line instead of selected_rows.append(row)
if you want to keep the records of file1 instead of those of master_file):
import csv
import pandas as pd
df = pd.read_csv(r"file1.csv")
list_codes = list(df.EMP_Code)
selected_rows = []
with open(r"master_file.csv") as csv_file:
reader = csv.DictReader(csv_file)
for row in reader:
if row['EMP_Code'] in list_codes:
selected_rows.append(row)
# selected_rows.extend(df.loc[df.EMP_Code == row['EMP_Code']].to_dict('records'))
article_usage = pd.DataFrame.from_records(selected_rows)
article_usage.to_csv(r"Employee_full_data.csv", index=False)
Employee_full_data.csv:
EMP_name,EMP_age,EMP_Service,EMP_Code,EMP_dept
a,30,6,s283,abc
b,29,3,f367,xyz
d,45,10,c264,abc