Home > Software engineering >  Read csv file with empty lines
Read csv file with empty lines

Time:04-27

Analysis software I'm using outputs many groups of results in 1 csv file and separates the groups with 2 empty lines. I would like to break the results in groups so that I can then analyse them separately.

I'm sure there is a built-in function in python (or one of it's libraries) that does this, I tried this piece of code that I found somewhere but it doesn't seem to work.

import csv
results = open('03_12_velocity_y.csv').read().split("\n\n")
# Feed first csv.reader
first_csv = csv.reader(results[0], delimiter=',')
# Feed second csv.reader
second_csv = csv.reader(results[1], delimiter=',')

CodePudding user response:

If your row counts are inconsistent across groups, you'll need a little state machine to check when you're between groups and do something with the last group.

#!/usr/bin/env python3

import csv


def write_group(group, i):
    with open(f"group_{i}.csv", "w", newline="") as out_f:
        csv.writer(out_f).writerows(group)


with open("input.csv", newline="") as f:
    reader = csv.reader(f)

    group_i = 1
    group = []
    last_row = []
    for row in reader:
        if row == [] and last_row == [] and group != []:
            write_group(group, group_i)
            group = []
            group_i  = 1
            continue

        if row == []:
            last_row = row
            continue

        group.append(row)
        last_row = row

# flush remaining group
if group != []:
    write_group(group, group_i)

I mocked up this sample CSV:

g1r1c1,g1r1c2,g1r1c3
g1r2c1,g1r2c2,g1r2c3
g1r3c1,g1r3c2,g1r3c3


g2r1c1,g2r1c2,g2r1c3
g2r2c1,g2r2c2,g2r2c3


g3r1c1,g3r1c2,g3r1c3
g3r2c1,g3r2c2,g3r2c3
g3r3c1,g3r3c2,g3r3c3
g3r4c1,g3r4c2,g3r4c3
g3r5c1,g3r5c2,g3r5c3

And when I run the program above I get three CSV files:

group_1.csv

g1r1c1,g1r1c2,g1r1c3
g1r2c1,g1r2c2,g1r2c3
g1r3c1,g1r3c2,g1r3c3

group_2.csv

g2r1c1,g2r1c2,g2r1c3
g2r2c1,g2r2c2,g2r2c3

group_3.csv

g3r1c1,g3r1c2,g3r1c3
g3r2c1,g3r2c2,g3r2c3
g3r3c1,g3r3c2,g3r3c3
g3r4c1,g3r4c2,g3r4c3
g3r5c1,g3r5c2,g3r5c3

CodePudding user response:

If your row counts are consistent, you can do this with fairly vanilla Python or using the Pandas library.

Vanilla Python

  • Define your group size and the size of the break (in "rows") between groups.
  • Loop over all the rows adding each row to a group accumulator.
  • When the group accumulator reaches the pre-defined group size, do something with it, reset the accumulator, and then skip break-size rows.

Here, I'm writing each group to its own numbered file:

import csv

group_sz = 5
break_sz = 2


def write_group(group, i):
    with open(f"group_{i}.csv", "w", newline="") as f_out:
        csv.writer(f_out).writerows(group)


with open("input.csv", newline="") as f_in:
    reader = csv.reader(f_in)

    group_i = 1
    group = []
    for row in reader:
        group.append(row)

        if len(group) == group_sz:
            write_group(group, group_i)

            group_i  = 1
            group = []

            for _ in range(break_sz):
                try:
                    next(reader)
                except StopIteration:  # gracefully ignore an expected StopIteration (at the end of the file)
                    break

group_1.csv

g1r1c1,g1r1c2,g1r1c3
g1r2c1,g1r2c2,g1r2c3
g1r3c1,g1r3c2,g1r3c3
g1r4c1,g1r4c2,g1r4c3
g1r5c1,g1r5c2,g1r5c3

With Pandas

I'm new to Pandas, and learning this as I go, but it looks like Pandas will automatically trim blank rows/records from a chunk of data^1.

With that in mind, all you need to do is specify the size of your group, and tell Pandas to read your CSV file in "iterator mode", where you can ask for a chunk (your group size) of records at a time:

import pandas as pd

group_sz = 5

with pd.read_csv("input.csv", header=None, iterator=True) as reader:
    i = 1
    while True:
        try:
            df = reader.get_chunk(group_sz)
        except StopIteration:
            break

        df.to_csv(f"group_{i}.csv")
        i  = 1

Pandas add an "ID" column and default header when it writes out the CSV:

group_1.csv

,0,1,2
0,g1r1c1,g1r1c2,g1r1c3
1,g1r2c1,g1r2c2,g1r2c3
2,g1r3c1,g1r3c2,g1r3c3
3,g1r4c1,g1r4c2,g1r4c3
4,g1r5c1,g1r5c2,g1r5c3

CodePudding user response:

TRY this out with your output:

import pandas as pd

# csv file name to be read in 
in_csv = 'input.csv'

# get the number of lines of the csv file to be read
number_lines = sum(1 for row in (open(in_csv)))

# size of rows of data to write to the csv, 
# you can change the row size according to your need
rowsize = 500

# start looping through data writing it to a new file for each set
for i in range(1,number_lines,rowsize):
    df = pd.read_csv(in_csv,
         header=None,
         nrows = rowsize,#number of rows to read at each loop
         skiprows = i)#skip rows that have been read

     #csv to write data to a new file with indexed name. input_1.csv etc.
     out_csv = 'input'   str(i)   '.csv'

df.to_csv(out_csv,
    index=False,
    header=False,
    mode='a', #append data to csv file
)
  • Related