Split Large csv File into multiple files depending on row value python-CodePudding

Background

I have a large csv file in a specific format (NEM12) which is too big to work with. the file format is as below;

The file always starts with 100
Rows with a 200 represent the start of a new dataset
Rows with 300 or 400 represent data for the dataset
The File always ends with a 900

Example Below

100 NEM12               
200 NMI INFO    INFO        
300 20211001    0   0   0   0
400 20  20  F17     
300 20211002    0   0   0   0
300 20211003    0   0   0   0
200 NMI INFO    INFO        
300 20211001    0   0   0   0
300 20211002    0   0   0   0
300 20211003    0   0   0   0
300 20211004    0   0   0   0
300 20211005    0   0   0   0
…                   
200 NMI INFO    INFO        
300 20211001    0   0   0   0
300 20211002    0   0   0   0
400 20  20  F17     
300 20211003    0   0   0   0
300 20211004    0   0   0   0
900

What I'm Trying to do

I am trying to split the large file, into hundreds of smaller files. Each smaller file would contain a single 200 row with the corresponding 300 and 400 rows worth of values.

What I've tried

I've tried to read the file in via pandas, but due to its irregular shape, that hasn't worked.

I've successfully been able to iterate through lines via the below code, but it splits each value into its own column (i.e. instead of 200, it goes 2, 0, 0).

Any help be appreciated.

for line in open(test):
    if left(line, 3) == '200':
        try:
            with open(fname, 'a', newline='') as f_object:
                writer_object = writer(f_object)
                writer_object.writerow('900')
            f_object.close()
        except NameError:
            print('ignore')
        fname = str(line.replace(',', '').replace('\n', ''))   '.csv'
        with open(fname, 'w', newline='') as f_object:
            writer_object = writer(f_object)
            writer_object.writerow('100')
            writer_object.writerow(line)
    if left(line, 3) == '300' or left(line, 3) == '400':
        with open(fname, 'a', newline='') as f_object:
            writer_object = writer(f_object)
            writer_object.writerow(line)

CodePudding user response：

Here is one approach.

fn = 'NEM12#000000000000001#CNRGYMDP#NEMMCO.csv'

cnt = 0
outfn = f'out_{cnt}.csv'

with open(fn, 'r') as f:
    for line in f:
        if line.startswith('100,'):  # don't write
            continue
        elif line.startswith('900'):  # don't write
            continue
        elif line.startswith('200,'):  # write detect start
            cnt  = 1
            outfn = f'out_{cnt}.csv'  # new filename
            
        if line.startswith(('200,', '300,', '400,')):
            with open(outfn, 'a') as w:  # write
                w.write(f'{line}'):

The output will be out_1.csv, out_2.csv etc

CodePudding user response：

Thanks to @Ferdy for the assistance,

Using the code you provided, as well as my original code, I was able to solve the issue

from csv import writer

for line in open(test):
    if line.startswith('200'):
        try:
            with open(fname, 'a', newline='') as f:
                w = writer(f)
                w.writerow(['900'])
            f.close()
        except NameError:
            print('ignore')
        flist = [str(line).split(",")[x] for x in [1, 3, 6, 7, 8]]
        fname = '_'.join(flist)   '.csv'
        print(fname)
        with open(fname, 'w', newline='') as f:
            w = writer(f)
            w.writerow(['100', 'NEM12', 'DATECREATED', 'MDYMDP', 'NAME'])
            w.writerow(str(line).split(","))
    if line.startswith(('300,', '400,')):
        with open(fname, 'a', newline='') as f:
            w = writer(f)
            w.writerow(str(line).split(","))