Speed up python split process-CodePudding

I have a very big 4 GB size of textfile and I have a script which splits the file into small files based on what characters are before the first coma. eg.: 16,.... line goes to 16.csv, 61,.... line goes to 61.csv. Unfortunately this script runs for ages, I guess because of the write out method. Is there any way to speed up the script?

import pandas as pd
import csv


with open (r"updates//merged_lst.csv",encoding="utf8", errors='ignore') as f:
    r = f.readlines()

for i in range(len(r)):
    row = r[i]
    letter = r[i].split(',')[0]
    filename = r"import//" letter.upper() ".csv"
    with open(filename,'a',encoding="utf8", errors='ignore') as f:
        f.write(row)

CodePudding user response：

I'm not sure if this really makes a huge difference, or if the bottleneck is somewhere else, but instead of opening and closing the output file for each line in the input, I would open each output file once and reuse it.

In order to keep multiple files open at the same time and use the correct one, I would put them in a dictionary, using the letter as key:

files = {}

for i in range(len(r)):
    row = r[i]
    letter = r[i].split(',')[0]
    if letter not in files:
        filename = r"import//" letter.upper() ".csv"
        files[letter] = open(filename,'a',encoding="utf8", errors='ignore')
    f = files[letter]
    f.write(row)

(Instead of checking if letter not in files you could also use files.setdefault.)

This has the drawback that you can't use a with block to automatically close the files. You have to close them at the end manually to ensure that the contents are written to disk.

for f in files.values():
    f.close()

CodePudding user response：

Reading sections of the files and using to_csv will speed this script up. This example reads the big file 500,000 lines at a time.

import pandas as pd
    
r = pd.read_csv(r"updates//merged_lst.csv", chunksize=500000, encoding="utf8", errors='ignore')

for chunk in r:
    for index, row in chunk.iterrows():
        letter = row[0].split(',')[0]
        f = r"import//" letter.upper() ".csv"
        chunk.loc[index:index].to_csv(f, mode='a', header=False, index=False)