Home > database >  Reverse part of a .csv file using Python
Reverse part of a .csv file using Python

Time:07-17

I had a csv file looking like this:

index label observation groundTruth
0     1     10.00       0
1     3      5.50       0
2     1     18.90       1
---------------------------
3     1     12.00       1
4     3     23.68       0
5     1     21.45       0
6     3      6.57       1
7     1     10.00       1

These data represents time series observations where each chain has a set lenght of 5. Since not all observation chains are 5 long by default, some padding was added to artifically increase length using this code to get this file:

index label observation groundTruth
0     1     10.00       0
1     3      5.50       0
2     1     18.90       1
3     0        0        0
4     0        0        0
--------------------------
5     1     12.00       1
6     3     23.68       0
7     1     21.45       0
8     3      6.57       1
9     1     10.00       1

This is the code:

line = [0,0,0]

with open(input_file, 'r') as inp, open(output_file, 'a') as out:
    writer = csv.writer(out)
    reader = csv.reader(inp)
    counter = 0
    
    for row in reader:
        counter  = 1
        if(row[0]=='s' and counter<6):
            while(counter<6):
                writer.writerow(line)
                counter =1
            counter=0
        else:
            writer.writerow(row)

My problem is, this padding needs to be at the beginning of each sequence, not the end.

What I need is the file to look like this:

index label observation groundTruth
0     0        0        0
1     0        0        0
2     1     10.00       0
3     3      5.50       0
4     1     18.90       1
--------------------------
5     1     12.00       1
6     3     23.68       0
7     1     21.45       0
8     3      6.57       1
9     1     10.00       1

I tried simply reversing the output csv file like this:

with open('data/test.csv', 'r') as inp, open('data/test_reverse.csv', 'a') as out:
    writer = csv.writer(out)
    reader = csv.reader(inp)
    
    for row in reversed(list(reader)):
        writer.writerow(row)

but this reverses the full time series which again produces non-sensible data that I do not want:

index label observation groundTruth
0     0        0        0
1     0        0        0
2     1     18.90       1
3     3      5.50       0
4     1     10.00       0
--------------------------
5     1     10.00       1
6     3      6.57       1
7     1     21.45       0
8     3     23.68       0
9     1     12.00       1

Any idea how to achive this?

Note: The --- is not part of my .csv, it is just there as help to make the problem clearer.

Note 2: Padding lines can be reliably detected, as label 0 is not a natural occurence within the data. (In case that helps solve the problem).

CodePudding user response:

If all observations are of length=5 then you can use next example how to move all rows with label="0" to front:

import csv
from itertools import zip_longest


def grouper(iterable, n, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx"
    args = [iter(iterable)] * n
    return zip_longest(*args, fillvalue=fillvalue)


with open("data.csv", "r") as f_in, open("out.csv", "w") as f_out:
    reader = csv.reader(f_in)
    writer = csv.writer(f_out)

    # write headers to output
    writer.writerow(next(reader))

    for rows in grouper(reader, 5):
        # save index column
        index_column, *_ = zip(*rows)

        # move rows with label=="0" to front:
        rows = sorted(rows, key=lambda k: k[1] != "0")

        # correct index column
        for i, r in zip(index_column, rows):
            r[0] = i

        # write to csv file
        writer.writerows(rows)

Writes out.csv:

index,label,observation,groundTruth
0,0,0,0
1,0,0,0
2,1,10.00,0
3,3,5.50,0
4,1,18.90,1
5,1,12.00,1
6,3,23.68,0
7,1,21.45,0
8,3,6.57,1
9,1,10.00,1

CodePudding user response:

How about modifying the original program to write the padding correctly?

(I'm using Python 3.10)

import csv
from typing import Any

Rows = list[list[Any]]


def pad_rows(rows: Rows) -> Rows:
    max_rows = 6

    n_rows = len(rows)
    if n_rows >= max_rows:
        return rows

    pad_n = max_rows - n_rows
    pad = [[0, 0, 0]] * pad_n
    return rows   pad


with (
    open("input.csv", newline="") as f_in,  # the csv module docs recommend newline=""
    open("output.csv", "w", newline="") as f_out,  # I changed "a" to "w" for my dev/testing
):
    reader = csv.reader(f_in)
    writer = csv.writer(f_out)

    writer.writerow(next(reader))  # header

    series: Rows = []
    for row in reader:
        if row[0] == "s" and series != []:
            writer.writerows(pad_rows(series))
            series = []
            continue

        series.append(row)

    # Write final series if "s" (break) wasn't the last non-empty row
    if series != []:
        writer.writerows(pad_rows(series))

Actually, that produces the original, unwanted output:

| label | observation | groundTruth |
|-------|-------------|-------------|
| 1     | 10.00       | 0           |
| 3     | 5.50        | 0           |
| 1     | 18.90       | 1           |
| 0     | 0           | 0           |
| 0     | 0           | 0           |
| 1     | 12.00       | 1           |
| 3     | 23.68       | 0           |
| 1     | 21.45       | 0           |
| 3     | 6.57        | 1           |
| 1     | 10.00       | 1           |

I believe you can spot the one line to change to make it work the way you want. (hint: it's in the pad_rows function)

  • Related