Background
I have a large csv file in a specific format (NEM12) which is too big to work with. the file format is as below;
- The file always starts with 100
- Rows with a 200 represent the start of a new dataset
- Rows with 300 or 400 represent data for the dataset
- The File always ends with a 900
Example Below
100 NEM12
200 NMI INFO INFO
300 20211001 0 0 0 0
400 20 20 F17
300 20211002 0 0 0 0
300 20211003 0 0 0 0
200 NMI INFO INFO
300 20211001 0 0 0 0
300 20211002 0 0 0 0
300 20211003 0 0 0 0
300 20211004 0 0 0 0
300 20211005 0 0 0 0
…
200 NMI INFO INFO
300 20211001 0 0 0 0
300 20211002 0 0 0 0
400 20 20 F17
300 20211003 0 0 0 0
300 20211004 0 0 0 0
900
What I'm Trying to do
I am trying to split the large file, into hundreds of smaller files. Each smaller file would contain a single 200 row with the corresponding 300 and 400 rows worth of values.
What I've tried
I've tried to read the file in via pandas, but due to its irregular shape, that hasn't worked.
I've successfully been able to iterate through lines via the below code, but it splits each value into its own column (i.e. instead of 200, it goes 2, 0, 0).
Any help be appreciated.
for line in open(test):
if left(line, 3) == '200':
try:
with open(fname, 'a', newline='') as f_object:
writer_object = writer(f_object)
writer_object.writerow('900')
f_object.close()
except NameError:
print('ignore')
fname = str(line.replace(',', '').replace('\n', '')) '.csv'
with open(fname, 'w', newline='') as f_object:
writer_object = writer(f_object)
writer_object.writerow('100')
writer_object.writerow(line)
if left(line, 3) == '300' or left(line, 3) == '400':
with open(fname, 'a', newline='') as f_object:
writer_object = writer(f_object)
writer_object.writerow(line)
CodePudding user response:
Here is one approach.
fn = 'NEM12#000000000000001#CNRGYMDP#NEMMCO.csv'
cnt = 0
outfn = f'out_{cnt}.csv'
with open(fn, 'r') as f:
for line in f:
if line.startswith('100,'): # don't write
continue
elif line.startswith('900'): # don't write
continue
elif line.startswith('200,'): # write detect start
cnt = 1
outfn = f'out_{cnt}.csv' # new filename
if line.startswith(('200,', '300,', '400,')):
with open(outfn, 'a') as w: # write
w.write(f'{line}'):
The output will be out_1.csv, out_2.csv etc
CodePudding user response:
Thanks to @Ferdy for the assistance,
Using the code you provided, as well as my original code, I was able to solve the issue
from csv import writer
for line in open(test):
if line.startswith('200'):
try:
with open(fname, 'a', newline='') as f:
w = writer(f)
w.writerow(['900'])
f.close()
except NameError:
print('ignore')
flist = [str(line).split(",")[x] for x in [1, 3, 6, 7, 8]]
fname = '_'.join(flist) '.csv'
print(fname)
with open(fname, 'w', newline='') as f:
w = writer(f)
w.writerow(['100', 'NEM12', 'DATECREATED', 'MDYMDP', 'NAME'])
w.writerow(str(line).split(","))
if line.startswith(('300,', '400,')):
with open(fname, 'a', newline='') as f:
w = writer(f)
w.writerow(str(line).split(","))