Home > Software engineering >  More efficient way to copy file line by line in python?
More efficient way to copy file line by line in python?

Time:03-20

I have 10GB file with that pattern:

Header,
header2,
header3,4
content
aaa, HO222222222222, AD, CE 
bbb, HO222222222222, AS, AE 
ccc, HO222222222222, AD, CE 
ddd, HO222222222222, BD, CE 
eee, HO222222222222, AD, CE 
fff, HO222222222222, BD, CE 
ggg, HO222222222222, AD, AE 
hhh, HO222222222222, AD, CE 
aaa, HO333333333333, AG, CE 
bbb, HO333333333333, AT, AE 
ccc, HO333333333333, AD, CT 
ddd, HO333333333333, BD, CE 
eee, HO333333333333, AD, CE 
fff, HO333333333333, BD, CE 
ggg, HO333333333333, AU, AE 
hhh, HO333333333333, AD, CE 
....

Let's say that in second column I have a ID. In whole files I have 4000 person and each have 50k records.

I can't use my prepared script for analysis on that big file (10GB - scripts in pandas, and I have too low memory. I know I should refactored it, and I working on it), so I need to divided that file to 4. But I can't split ID between files. I mean I can't have a part of one person in separate files.

So I write script. It divided file on 4 based on ID.

There is code:

file1 = open('file.txt', 'r')
count = 0
list_of_ids= set()
while True:
    if len(list_of_ids) < 1050:
        a = "out1.csv"
    elif (len(list_of_ids)) >= 1049 and (len(list_of_ids)) < 2100:
        a = "out2.csv"
    elif (len(list_of_ids)) >= 2099 and (len(list_of_ids)) < 3200:
        a = "out3.csv"
    else:
        a = "out4.csv"
        
    line = file1.readline()
 
    if not line:
        break
    
    try:
        
        list_of_ids.add(line.split(',')[1])
        out = open(a, "a")
        out.write(line)
        
    except IndexError as e:
        print(e)
    count  = 1
    
 
    
out.close()

But it's sooooo slow, and I need to speed it up. There is many if, and each time I open file, but I can't figure it out how to get better performance. Maybe someone have some tips?

CodePudding user response:

I think you want something more like this:

# this number is arbitrary, of course
ids_per_file = 1000
# use with, so the file always closes when you're done, or something happens
with open('20220317_EuroG_MD_v3_XT_POL_FinalReport.txt', 'r') as f:
    # an easier way to loop over all the lines:
    n = 0
    ids = set()
    try:
        for line in f:
            try:
                ids.add(line.split(',')[1])
            except IndexError:
                # you don't want to break, you just want to ignore the line and continue
                continue
            # when the number ids reaches the limit (or at the start), start a new file
            if not n or len(ids) > ids_per_file:
                # close the previous one, unless it's the first
                if n > 0:
                    out_f.close()
                # on to the next
                n  = 1
                out_f = open(f'out{n}.csv', 'w')
                # reset ids
                ids = {line.split(',')[1]}
            # write the line, if you get here, it's a record
            out_f.write(line)
    finally:
        # close the last file
        out_f.close()

Edit: actually had a bug, would write the first new identifier to the previous file, think this is better.

  • Related