Home > Enterprise >  How do I read different sections of a CSV file when the first 5 lines sometimes has more than 1 colu
How do I read different sections of a CSV file when the first 5 lines sometimes has more than 1 colu

Time:11-15

I'm being supplied a CSV file that sometimes has the following format where the first 5 lines have more than one column:

File For EMS Team Downloaded By Bob Mortimer At 17:22:36 09/11/2021,,,,,,
line two content,,,,,,
line 3 content.,,,,,,
line 4 content.,,,,,,
,,,,,,
1,TEAM,Bob Jones,Sar a  require transport,A,,18:34:04hrs on 17/10/21
2,TEAM,Peter Smith,Sar h,H,,20:43:49hrs on 17/10/21
3,TEAM,Neil Barnes,SAR H,H,,20:15:12hrs on 17/10/21

Other times the first 4 lines only have 1 column and the 5th none:

File For EMS Team Downloaded By Bob Mortimer At 17:22:36 09/11/2021
line two content
line 3 content.
line 4 content.

1,TEAM,Bob Jones,Sar a  require transport,A,,18:34:04hrs on 17/10/21
2,TEAM,Peter Smith,Sar h,H,,20:43:49hrs on 17/10/21
3,TEAM,Neil Barnes,SAR H,H,,20:15:12hrs on 17/10/21

I need to be able to cater for both types of file. I want to regex data out of the first 4 lines and then go on to build a list from the "normal" data from line 6 onwards.

I'm using csv_reader and tried testing if the line had more than one column but that only works on the second example. I then tried to test if the second column was empty with:

if row[1] == None:

but I get an IndexError: list index out of range which I understand

I can't test if row[0] is a number as in the second example the 5th line doesn't have any columns. What I need to do is read in the first x number of rows and regex the name,time and date, skip the empty row then read the next set of rows as normal.

It's this reading of the first block in either format then the next block I'm struggling with.

Currently I've got:

with open('file.csv', 'r') as csvfile:
    csv_reader = csv.reader(csvfile)
    for row in csv_reader:
    if len(row) > 0 and row[0] != "":
        print(row)

which is missing out the empty row in both examples but I'm struggling with list index out of range errors when I try and test on row[1].

I'm sure there's a simple way of doing this but none of my googling has come up with anything yet.

CodePudding user response:

In both cases, I'd generalize your CSV as follows:

  • lines 1-4: special "lines" of text
  • line 5: garbage (discard)
  • lines 6-...: meaningful "rows"

Here's that general approach in code. The parse_special_csv function takes a filename as input and return two lists:

  • the first is a list of "lines" (1-4); they're technically rows, but it's more about how you treat them/what you do with them
  • the second is a list of rows, (lines 6-...)

My thinking being, that once you have the data split out, and file is completely parsed, you'll know what to do with lines and what to do with rows:

import csv

def parse_special_csv(fname):
    lines = []
    rows = []
    with open(fname, 'r', newline='') as f:
        reader = csv.reader(f)

        # Treat lines 1-4 as just "lines"
        for i in range(4):
            row = next(reader)    # manually advance the reader
            lines.append(row[0])  # safe to index for first column, because *you know* these lines have column-like data
        
        # Discard line 5
        next(reader)

        # Treat the remaining lines as CSV rows
        for row in reader:
            rows.append(row)

    return lines, rows

lines, rows = parse_special_csv('sample1.csv')
print('sample1')
print('lines:')
print(lines)
print('rows:')
print(rows)
print()

lines, rows = parse_special_csv('sample2.csv')
print('sample2')
print('lines:')
print(lines)
print('rows:')
print(rows)
print()

And I get, based on your samples:

sample1
lines:
[
 'File For EMS Team Downloaded By Bob Mortimer At 17:22:36 09/11/2021',
 'line two content',
 'line 3 content.',
 'line 4 content.'
]
rows:
[
 ['1', 'TEAM', 'Bob Jones', 'Sar a  require transport', 'A', '', '18:34:04hrs on 17/10/21'],
 ['2', 'TEAM', 'Peter Smith', 'Sar h', 'H', '', '20:43:49hrs on 17/10/21'],
 ['3', 'TEAM', 'Neil Barnes', 'SAR H', 'H', '', '20:15:12hrs on 17/10/21']
]

sample2
lines:
[
 'File For EMS Team Downloaded By Bob Mortimer At 17:22:36 09/11/2021',
 'line two content',
 'line 3 content.',
 'line 4 content.'
]
rows:
[
 ['1', 'TEAM', 'Bob Jones', 'Sar a  require transport', 'A', '', '18:34:04hrs on 17/10/21'],
 ['2', 'TEAM', 'Peter Smith', 'Sar h', 'H', '', '20:43:49hrs on 17/10/21'],
 ['3', 'TEAM', 'Neil Barnes', 'SAR H', 'H', '', '20:15:12hrs on 17/10/21']
]

Also, next(reader) may look a little foreign, but it's the correct way to manually advance the CSV reader^1 (and any iterator in Python, in general^2).

  • Related