Home > Software design >  Python read from csv with condition for TimeSeriesGenerator
Python read from csv with condition for TimeSeriesGenerator

Time:07-17

I have a .csv file with many entries that looks like this:

observation1, observation2, tag
observation1, observation2, tag
...
b r e a k
observation1, observation2, tag
...
b r e a k

whereas the observations are some numbers and the tag the ground truth true/false.

the break part comes with the data and symbolizes the end of a file and the end of an observation chain. Datapoints within two break entries belong together. (All those datapoints are merged from multiple files into one huge csv).

With this data I am supposed to do some machine learning using the tensorflow TimeSeriesGenerator.

I found out however, that TSG uses a fixed time series chain length, which means I have to do some cutting/filtering of my data given.

Condition one, is that if a true appears in the chain, it has to be the last value. Condition two, that all chains consist of the same amount of entries.

This means, if say my chain length would be 3, then the following chains are allowed:

b r e a k
observation1, observation2, false
observation1, observation2, false
observation1, observation2, true
b r e a k
b r e a k
observation1, observation2, false
observation1, observation2, false
observation1, observation2, false
b r e a k

but not

b r e a k
observation1, observation2, false
observation1, observation2, true
observation1, observation2, false
b r e a k

A chain like this would also be allowed

observation1, observation2, false
observation1, observation2, false
observation1, observation2, false
observation1, observation2, true

as I could simply throw the first line away to get a length of 3.

But not a chain like this:

observation1, observation2, false
b r e a k
observation1, observation2, false
observation1, observation2, true
b r e a k

This means I need some way (my guess would be pandas) to filter the .csv file and find all occurences, where between to b r e a k lines there are at least x amount of false datapoints followed by a true or another false.

What would be a good way of achieving this filtering?

CodePudding user response:

I found a solution myself after tinkering around some more. I will post in case anyone else ever stumbles upon this:

What I did, was leave the chain length of x out for now and simply filtered for full false chains or chains of falses until first true. I also did not change the base file as initially intended, but wrote to a new file.

For my case, I then padded all chains but the longest with 0 observations to ensure unified chain length across all input chains. (Not included in this code example, there are many tutorials found online).

One could also cut all chains but the longest I guess, but I did not really try that.

Anyway, this code filters chains to the above described standard:

import csv

input_file = 'pathtofile'
output_file = 'outfile'

with open(input_file, 'r') as inp, open(output_file, 'a') as out:
    writer = csv.writer(out)
    reader = csv.reader(inp)
    helper_value = 0
    
    for row in reader:
        if(row[0]=='b r e a k'):
            helper_value = 0
            writer.writerow('s')
        else:
            if(helper_value==0):
                gt = row[3]
                if(gt=='false'):
                    writer.writerow(row)
                else:
                    writer.writerow(row)
                    helper_value=1

The writer.writerow('s') is to ensure seperate cases are kept seperate.

  • Related