Reverse part of a .csv file using Python-CodePudding

I had a csv file looking like this:

index label observation groundTruth
0     1     10.00       0
1     3      5.50       0
2     1     18.90       1
---------------------------
3     1     12.00       1
4     3     23.68       0
5     1     21.45       0
6     3      6.57       1
7     1     10.00       1

These data represents time series observations where each chain has a set lenght of 5. Since not all observation chains are 5 long by default, some padding was added to artifically increase length using this code to get this file:

index label observation groundTruth
0     1     10.00       0
1     3      5.50       0
2     1     18.90       1
3     0        0        0
4     0        0        0
--------------------------
5     1     12.00       1
6     3     23.68       0
7     1     21.45       0
8     3      6.57       1
9     1     10.00       1

This is the code:

line = [0,0,0]

with open(input_file, 'r') as inp, open(output_file, 'a') as out:
    writer = csv.writer(out)
    reader = csv.reader(inp)
    counter = 0
    
    for row in reader:
        counter  = 1
        if(row[0]=='s' and counter<6):
            while(counter<6):
                writer.writerow(line)
                counter =1
            counter=0
        else:
            writer.writerow(row)

My problem is, this padding needs to be at the beginning of each sequence, not the end.

What I need is the file to look like this:

index label observation groundTruth
0     0        0        0
1     0        0        0
2     1     10.00       0
3     3      5.50       0
4     1     18.90       1
--------------------------
5     1     12.00       1
6     3     23.68       0
7     1     21.45       0
8     3      6.57       1
9     1     10.00       1

I tried simply reversing the output csv file like this:

with open('data/test.csv', 'r') as inp, open('data/test_reverse.csv', 'a') as out:
    writer = csv.writer(out)
    reader = csv.reader(inp)
    
    for row in reversed(list(reader)):
        writer.writerow(row)

but this reverses the full time series which again produces non-sensible data that I do not want:

index label observation groundTruth
0     0        0        0
1     0        0        0
2     1     18.90       1
3     3      5.50       0
4     1     10.00       0
--------------------------
5     1     10.00       1
6     3      6.57       1
7     1     21.45       0
8     3     23.68       0
9     1     12.00       1

Any idea how to achive this?

Note: The --- is not part of my .csv, it is just there as help to make the problem clearer.

Note 2: Padding lines can be reliably detected, as label 0 is not a natural occurence within the data. (In case that helps solve the problem).

CodePudding user response：

If all observations are of length=5 then you can use next example how to move all rows with label="0" to front:

import csv
from itertools import zip_longest


def grouper(iterable, n, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx"
    args = [iter(iterable)] * n
    return zip_longest(*args, fillvalue=fillvalue)


with open("data.csv", "r") as f_in, open("out.csv", "w") as f_out:
    reader = csv.reader(f_in)
    writer = csv.writer(f_out)

    # write headers to output
    writer.writerow(next(reader))

    for rows in grouper(reader, 5):
        # save index column
        index_column, *_ = zip(*rows)

        # move rows with label=="0" to front:
        rows = sorted(rows, key=lambda k: k[1] != "0")

        # correct index column
        for i, r in zip(index_column, rows):
            r[0] = i

        # write to csv file
        writer.writerows(rows)

Writes out.csv:

index,label,observation,groundTruth
0,0,0,0
1,0,0,0
2,1,10.00,0
3,3,5.50,0
4,1,18.90,1
5,1,12.00,1
6,3,23.68,0
7,1,21.45,0
8,3,6.57,1
9,1,10.00,1

CodePudding user response：

How about modifying the original program to write the padding correctly?

(I'm using Python 3.10)

import csv
from typing import Any

Rows = list[list[Any]]


def pad_rows(rows: Rows) -> Rows:
    max_rows = 6

    n_rows = len(rows)
    if n_rows >= max_rows:
        return rows

    pad_n = max_rows - n_rows
    pad = [[0, 0, 0]] * pad_n
    return rows   pad


with (
    open("input.csv", newline="") as f_in,  # the csv module docs recommend newline=""
    open("output.csv", "w", newline="") as f_out,  # I changed "a" to "w" for my dev/testing
):
    reader = csv.reader(f_in)
    writer = csv.writer(f_out)

    writer.writerow(next(reader))  # header

    series: Rows = []
    for row in reader:
        if row[0] == "s" and series != []:
            writer.writerows(pad_rows(series))
            series = []
            continue

        series.append(row)

    # Write final series if "s" (break) wasn't the last non-empty row
    if series != []:
        writer.writerows(pad_rows(series))

Actually, that produces the original, unwanted output:

| label | observation | groundTruth |
|-------|-------------|-------------|
| 1     | 10.00       | 0           |
| 3     | 5.50        | 0           |
| 1     | 18.90       | 1           |
| 0     | 0           | 0           |
| 0     | 0           | 0           |
| 1     | 12.00       | 1           |
| 3     | 23.68       | 0           |
| 1     | 21.45       | 0           |
| 3     | 6.57        | 1           |
| 1     | 10.00       | 1           |

I believe you can spot the one line to change to make it work the way you want. (hint: it's in the pad_rows function)