I have a very big 4 GB size of textfile and I have a script which splits the file into small files based on what characters are before the first coma. eg.: 16,.... line goes to 16.csv, 61,.... line goes to 61.csv. Unfortunately this script runs for ages, I guess because of the write out method. Is there any way to speed up the script?
import pandas as pd
import csv
with open (r"updates//merged_lst.csv",encoding="utf8", errors='ignore') as f:
r = f.readlines()
for i in range(len(r)):
row = r[i]
letter = r[i].split(',')[0]
filename = r"import//" letter.upper() ".csv"
with open(filename,'a',encoding="utf8", errors='ignore') as f:
f.write(row)
CodePudding user response:
I'm not sure if this really makes a huge difference, or if the bottleneck is somewhere else, but instead of opening and closing the output file for each line in the input, I would open each output file once and reuse it.
In order to keep multiple files open at the same time and use the correct one, I would put them in a dictionary, using the letter
as key:
files = {}
for i in range(len(r)):
row = r[i]
letter = r[i].split(',')[0]
if letter not in files:
filename = r"import//" letter.upper() ".csv"
files[letter] = open(filename,'a',encoding="utf8", errors='ignore')
f = files[letter]
f.write(row)
(Instead of checking if letter not in files
you could also use files.setdefault
.)
This has the drawback that you can't use a with
block to automatically close the files. You have to close them at the end manually to ensure that the contents are written to disk.
for f in files.values():
f.close()
CodePudding user response:
Reading sections of the files and using to_csv
will speed this script up. This example reads the big file 500,000 lines at a time.
import pandas as pd
r = pd.read_csv(r"updates//merged_lst.csv", chunksize=500000, encoding="utf8", errors='ignore')
for chunk in r:
for index, row in chunk.iterrows():
letter = row[0].split(',')[0]
f = r"import//" letter.upper() ".csv"
chunk.loc[index:index].to_csv(f, mode='a', header=False, index=False)