I'm being supplied a CSV file that sometimes has the following format where the first 5 lines have more than one column:
File For EMS Team Downloaded By Bob Mortimer At 17:22:36 09/11/2021,,,,,,
line two content,,,,,,
line 3 content.,,,,,,
line 4 content.,,,,,,
,,,,,,
1,TEAM,Bob Jones,Sar a require transport,A,,18:34:04hrs on 17/10/21
2,TEAM,Peter Smith,Sar h,H,,20:43:49hrs on 17/10/21
3,TEAM,Neil Barnes,SAR H,H,,20:15:12hrs on 17/10/21
Other times the first 4 lines only have 1 column and the 5th none:
File For EMS Team Downloaded By Bob Mortimer At 17:22:36 09/11/2021
line two content
line 3 content.
line 4 content.
1,TEAM,Bob Jones,Sar a require transport,A,,18:34:04hrs on 17/10/21
2,TEAM,Peter Smith,Sar h,H,,20:43:49hrs on 17/10/21
3,TEAM,Neil Barnes,SAR H,H,,20:15:12hrs on 17/10/21
I need to be able to cater for both types of file. I want to regex data out of the first 4 lines and then go on to build a list from the "normal" data from line 6 onwards.
I'm using csv_reader
and tried testing if the line had more than one column but that only works on the second example. I then tried to test if the second column was empty with:
if row[1] == None:
but I get an IndexError: list index out of range which I understand
I can't test if row[0]
is a number as in the second example the 5th line doesn't have any columns. What I need to do is read in the first x number of rows and regex the name,time and date, skip the empty row then read the next set of rows as normal.
It's this reading of the first block in either format then the next block I'm struggling with.
Currently I've got:
with open('file.csv', 'r') as csvfile:
csv_reader = csv.reader(csvfile)
for row in csv_reader:
if len(row) > 0 and row[0] != "":
print(row)
which is missing out the empty row in both examples but I'm struggling with list index out of range errors when I try and test on row[1]
.
I'm sure there's a simple way of doing this but none of my googling has come up with anything yet.
CodePudding user response:
In both cases, I'd generalize your CSV as follows:
- lines 1-4: special "lines" of text
- line 5: garbage (discard)
- lines 6-...: meaningful "rows"
Here's that general approach in code. The parse_special_csv
function takes a filename as input and return two lists:
- the first is a list of "lines" (1-4); they're technically rows, but it's more about how you treat them/what you do with them
- the second is a list of rows, (lines 6-...)
My thinking being, that once you have the data split out, and file is completely parsed, you'll know what to do with lines
and what to do with rows
:
import csv
def parse_special_csv(fname):
lines = []
rows = []
with open(fname, 'r', newline='') as f:
reader = csv.reader(f)
# Treat lines 1-4 as just "lines"
for i in range(4):
row = next(reader) # manually advance the reader
lines.append(row[0]) # safe to index for first column, because *you know* these lines have column-like data
# Discard line 5
next(reader)
# Treat the remaining lines as CSV rows
for row in reader:
rows.append(row)
return lines, rows
lines, rows = parse_special_csv('sample1.csv')
print('sample1')
print('lines:')
print(lines)
print('rows:')
print(rows)
print()
lines, rows = parse_special_csv('sample2.csv')
print('sample2')
print('lines:')
print(lines)
print('rows:')
print(rows)
print()
And I get, based on your samples:
sample1
lines:
[
'File For EMS Team Downloaded By Bob Mortimer At 17:22:36 09/11/2021',
'line two content',
'line 3 content.',
'line 4 content.'
]
rows:
[
['1', 'TEAM', 'Bob Jones', 'Sar a require transport', 'A', '', '18:34:04hrs on 17/10/21'],
['2', 'TEAM', 'Peter Smith', 'Sar h', 'H', '', '20:43:49hrs on 17/10/21'],
['3', 'TEAM', 'Neil Barnes', 'SAR H', 'H', '', '20:15:12hrs on 17/10/21']
]
sample2
lines:
[
'File For EMS Team Downloaded By Bob Mortimer At 17:22:36 09/11/2021',
'line two content',
'line 3 content.',
'line 4 content.'
]
rows:
[
['1', 'TEAM', 'Bob Jones', 'Sar a require transport', 'A', '', '18:34:04hrs on 17/10/21'],
['2', 'TEAM', 'Peter Smith', 'Sar h', 'H', '', '20:43:49hrs on 17/10/21'],
['3', 'TEAM', 'Neil Barnes', 'SAR H', 'H', '', '20:15:12hrs on 17/10/21']
]
Also, next(reader)
may look a little foreign, but it's the correct way to manually advance the CSV reader^1 (and any iterator in Python, in general^2).