Extracting specific rows by reading the large csv file in chunks and using the same header present i-CodePudding

I have 2 files, file1.csv and a large csv called master_file.csv. They have several columns and have a common column name called EMP_Code.

file1 example:

EMP_name	EMP_Code	EMP_dept
a	s283	abc
b	f367	xyz
d	c264	abc

master_file.csv example:

|EMP_name | EMP_age | EMP_Service | EMP_Code | EMP_dept |
|:------- | :-----: | :---------: |:-------: | :--------|
|a        | 30      | 6           | s283     | abc      |
|b        | 29      | 3           | f367     | xyz      |
|r        | 27      | 1           | g364     | lmn      |
|d        | 45      | 10          | c264     | abc      |
|t        | 50      | 25          | t453     | lmn      |

I want to extract similar rows from master_file using all the EMP_Code values in file1. I tried the following code and received the error. I cannot read the complete master csv file as it is around 20gb, has lakhs of rows and running out of memory. I want to read the master_file in chunks and extract the complete rows for each of the EMP_Code present in file1 and save it into new file Employee_full_data.

import pandas as pd
import numpy as np
import os
from pandas import DataFrame

df = pd.read_csv(r"file1.csv")
df_items = pd.read_csv(r"master_file.csv", names=['EMP_Code'], iterator=True, chunksize=10000)
article_usage = df.loc[df['EMP_Code'] .isin(df_items['EMP_Code'])]
article_usage.to_csv(r"Employee_full_data.csv")

    TypeError Traceback (most recent call last)
    /tmp/ipykernel_117278/3252874549.py in <module>
          6 df = pd.read_csv(r"file1.csv")
          7 df_items = pd.read_csv(r"master_file.csv", names=['EMP_Code'], iterator=True, chunksize=10000)
    ----> 8 article_usage = df.loc[df['EMP_Code'] .isin(df_items['EMP_Code'])]
          9 article_usage.to_csv(r"Employee_full_data.csv")
  
    TypeError: 'TextFileReader' object is not subscriptable

CodePudding user response：

One solution is to use the built-in csv library for better memory handling. Although it prevents the out-of-memory issue, it might take some time to complete due to using for loop to scan the rows one by one.

Assuming these two sample files:

file1.csv:

EMP_name,EMP_Code,EMP_dept
a,s283,abc
b,f367,xyz
d,c264,abc

master_file.csv:

EMP_name,EMP_age,EMP_Service,EMP_Code,EMP_dept
a,30,6,s283,abc
b,29,3,f367,xyz
r,27,1,g364,lmn
d,45,10,c264,abc
t,50,25,t453,lmn

You can use the following code snippet (use the commented line instead of selected_rows.append(row) if you want to keep the records of file1 instead of those of master_file):

import csv
import pandas as pd


df = pd.read_csv(r"file1.csv")

list_codes = list(df.EMP_Code)
selected_rows = []
with open(r"master_file.csv") as csv_file:
    reader = csv.DictReader(csv_file)
    for row in reader:
        if row['EMP_Code'] in list_codes:
            selected_rows.append(row)
            # selected_rows.extend(df.loc[df.EMP_Code == row['EMP_Code']].to_dict('records'))

article_usage = pd.DataFrame.from_records(selected_rows)
article_usage.to_csv(r"Employee_full_data.csv", index=False)

Employee_full_data.csv:

EMP_name,EMP_age,EMP_Service,EMP_Code,EMP_dept
a,30,6,s283,abc
b,29,3,f367,xyz
d,45,10,c264,abc