I had a csv file looking like this:
index label observation groundTruth
0 1 10.00 0
1 3 5.50 0
2 1 18.90 1
---------------------------
3 1 12.00 1
4 3 23.68 0
5 1 21.45 0
6 3 6.57 1
7 1 10.00 1
These data represents time series observations where each chain has a set lenght of 5. Since not all observation chains are 5 long by default, some padding was added to artifically increase length using this code to get this file:
index label observation groundTruth
0 1 10.00 0
1 3 5.50 0
2 1 18.90 1
3 0 0 0
4 0 0 0
--------------------------
5 1 12.00 1
6 3 23.68 0
7 1 21.45 0
8 3 6.57 1
9 1 10.00 1
This is the code:
line = [0,0,0]
with open(input_file, 'r') as inp, open(output_file, 'a') as out:
writer = csv.writer(out)
reader = csv.reader(inp)
counter = 0
for row in reader:
counter = 1
if(row[0]=='s' and counter<6):
while(counter<6):
writer.writerow(line)
counter =1
counter=0
else:
writer.writerow(row)
My problem is, this padding needs to be at the beginning of each sequence, not the end.
What I need is the file to look like this:
index label observation groundTruth
0 0 0 0
1 0 0 0
2 1 10.00 0
3 3 5.50 0
4 1 18.90 1
--------------------------
5 1 12.00 1
6 3 23.68 0
7 1 21.45 0
8 3 6.57 1
9 1 10.00 1
I tried simply reversing the output csv file like this:
with open('data/test.csv', 'r') as inp, open('data/test_reverse.csv', 'a') as out:
writer = csv.writer(out)
reader = csv.reader(inp)
for row in reversed(list(reader)):
writer.writerow(row)
but this reverses the full time series which again produces non-sensible data that I do not want:
index label observation groundTruth
0 0 0 0
1 0 0 0
2 1 18.90 1
3 3 5.50 0
4 1 10.00 0
--------------------------
5 1 10.00 1
6 3 6.57 1
7 1 21.45 0
8 3 23.68 0
9 1 12.00 1
Any idea how to achive this?
Note: The ---
is not part of my .csv, it is just there as help to make the problem clearer.
Note 2: Padding lines can be reliably detected, as label 0
is not a natural occurence within the data. (In case that helps solve the problem).
CodePudding user response:
If all observations are of length=5 then you can use next example how to move all rows with label="0" to front:
import csv
from itertools import zip_longest
def grouper(iterable, n, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx"
args = [iter(iterable)] * n
return zip_longest(*args, fillvalue=fillvalue)
with open("data.csv", "r") as f_in, open("out.csv", "w") as f_out:
reader = csv.reader(f_in)
writer = csv.writer(f_out)
# write headers to output
writer.writerow(next(reader))
for rows in grouper(reader, 5):
# save index column
index_column, *_ = zip(*rows)
# move rows with label=="0" to front:
rows = sorted(rows, key=lambda k: k[1] != "0")
# correct index column
for i, r in zip(index_column, rows):
r[0] = i
# write to csv file
writer.writerows(rows)
Writes out.csv
:
index,label,observation,groundTruth
0,0,0,0
1,0,0,0
2,1,10.00,0
3,3,5.50,0
4,1,18.90,1
5,1,12.00,1
6,3,23.68,0
7,1,21.45,0
8,3,6.57,1
9,1,10.00,1
CodePudding user response:
How about modifying the original program to write the padding correctly?
(I'm using Python 3.10)
import csv
from typing import Any
Rows = list[list[Any]]
def pad_rows(rows: Rows) -> Rows:
max_rows = 6
n_rows = len(rows)
if n_rows >= max_rows:
return rows
pad_n = max_rows - n_rows
pad = [[0, 0, 0]] * pad_n
return rows pad
with (
open("input.csv", newline="") as f_in, # the csv module docs recommend newline=""
open("output.csv", "w", newline="") as f_out, # I changed "a" to "w" for my dev/testing
):
reader = csv.reader(f_in)
writer = csv.writer(f_out)
writer.writerow(next(reader)) # header
series: Rows = []
for row in reader:
if row[0] == "s" and series != []:
writer.writerows(pad_rows(series))
series = []
continue
series.append(row)
# Write final series if "s" (break) wasn't the last non-empty row
if series != []:
writer.writerows(pad_rows(series))
Actually, that produces the original, unwanted output:
| label | observation | groundTruth |
|-------|-------------|-------------|
| 1 | 10.00 | 0 |
| 3 | 5.50 | 0 |
| 1 | 18.90 | 1 |
| 0 | 0 | 0 |
| 0 | 0 | 0 |
| 1 | 12.00 | 1 |
| 3 | 23.68 | 0 |
| 1 | 21.45 | 0 |
| 3 | 6.57 | 1 |
| 1 | 10.00 | 1 |
I believe you can spot the one line to change to make it work the way you want. (hint: it's in the pad_rows function)