Home > Net >  Data loss while extracting the rows from large csv file
Data loss while extracting the rows from large csv file

Time:10-27

This is in continuation from my previous question. I have 2 files, file1.csv and a large csv called master_file.csv. They have several columns and have a common column name called EMP_Code.

File 1 example:

EMP_name EMP_Code EMP_dept
b f367 abc
a c264 xyz
c d264 abc

master_file example:

EMP_name EMP_age EMP_Service EMP_Code EMP_dept
a         30       6           c264     xyz
b         29       3           f367     abc
r         27       1           g364     lmn
d         45       10          c264     abc
t         50       25          t453     lmn

I want to extract similar rows from master_file using all the EMP_Code values in file1. I tried the following code and I am loosing a lot of data. I cannot read the complete master csv file as it is around 20gb, has millions of rows and running out of memory. I want to read the master_file in chunks and extract the complete rows for each of the EMP_Code present in file1 and save it into new file Employee_full_data.

import csv
import pandas as pd

df = pd.read_csv(r"master_file.csv")
li = [c264,f367]
full_data = df[df.EMP_Code.isin(li)]
full_data.to_csv(r"Employee_full_data.csv", index=False)`

I also tried the following code. I receive an empty file whenever I use EMP_Code column and works fine when I use columns like Emp_name or EMP_dept. I want to extract the data using EMP_Code.

import csv
import pandas as pd

df = pd.read_csv(r"file1.csv")

list_codes = list(df.EMP_Code)
selected_rows = []
with open(r"master_file.csv") as csv_file:
    reader = csv.DictReader(csv_file)
    for row in reader:
        if row['EMP_Code'] in list_codes:
            selected_rows.append(row)`

article_usage = pd.DataFrame.from_records(selected_rows)
article_usage.to_csv(r"Employee_full_data.csv", index=False)

Is there any other way that I can extract the data without loss? I have heard about join and reading data in chunks but not sure how to use it here. Any help is appreciated

CodePudding user response:

You just have to pass chunksize=<SOME INTEGER> to pandas' .read_csv function (see documentation here)

If you pass a chunksize=2, you will read the file into dataframes of 2 rows. Or... more accurately, it will read 2 rows of the csv into a dataframe. You can then apply your filter to that 2-row dataframe and "accumulate" that into another dataframe. The next iteration will read the next two rows, which you can subsequently filter... Lather, rinse and repeat:

import pandas as pd
li = ['c264', 'f367']

result_df = pd.DataFrame()
with pd.read_csv("master_file.csv", chunksize=2) as reader:
    for chunk_df in reader:
        filtered_df = chunk_df[chunk_df.EMP_Code.isin(li)]
        result_df = pd.concat([result_df, filtered_df])
print(result_df)

# Outputs:
#   EMP_name  EMP_age  EMP_Service EMP_Code EMP_dept
# 0        a       30            6     c264      xyz
# 1        b       29            3     f367      abc
# 3        d       45           10     c264      abc

CodePudding user response:

one way that you could fix these type of file read/write task is to use the generator and read the data you want in chunks or portions that you could handle (memory or etc constraints).

def read_line():
    with open('master_file.csv','r') as fid:
        while (line:= fid.readline().split()):
            yield line 

this simple generator in each call give one new line. now you could simply iterate over this to do what ever filtering you are interested and build your new dataframe.

r_line = read_line()    
for l in r_line:
    print(l)

you could modify the generator to for example parse and return list, or multiple lines , etc.

CodePudding user response:

I ran the code from your 2nd example (using csv.DictReader) on your small example and it worked. I'm guessing your problem might have to do with the real-life scale of master_file as you've alluded to.

The problem might be that despite using csv.DictReader to stream information in, you're still using a Pandas dataframe to aggregate everything before writing it out, and maybe the output is breaking your memory budget.

If that's true, then use csv.DictWriter to stream out. The only tricky bit is getting the writer set up because it needs to know the fieldnames, which can't be know till we've read the first row, so we'll set up the writer in the first iteration of the read loop.

(I've removed the with open(... contexts because I think they add too much indentation)

df = pd.read_csv(r"file1.csv")
list_codes = list(df.EMP_Code)

f_in = open(r"master_file.csv", newline="")
reader = csv.DictReader(f_in)

f_out = open(r"output.csv", "w", newline="")

init_writer = True
for row in reader:
    if init_writer:
        writer = csv.DictWriter(f_out, fieldnames=row)
        writer.writeheader()
        init_writer = False

    if row["EMP_Code"] in list_codes:
        writer.writerow(row)

f_out.close()
f_in.close()
EMP_name EMP_age EMP_Service EMP_Code EMP_dept
a 30 6 c264 xyz
b 29 3 f367 abc
d 45 10 c264 abc

And if you'd like to get rid of Pandas altogether:

list_codes = set()
with open(r"file1.csv", newline="") as f:
    reader = csv.DictReader(f)
    for row in reader:
        list_codes.add(row["EMP_Code"])
  • Related