I have a .csv file with many entries that looks like this:
observation1, observation2, tag
observation1, observation2, tag
...
b r e a k
observation1, observation2, tag
...
b r e a k
whereas the observations are some numbers and the tag the ground truth true/false.
the break
part comes with the data and symbolizes the end of a file and the end of an observation chain. Datapoints within two break
entries belong together. (All those datapoints are merged from multiple files into one huge csv).
With this data I am supposed to do some machine learning using the tensorflow TimeSeriesGenerator.
I found out however, that TSG uses a fixed time series chain length, which means I have to do some cutting/filtering of my data given.
Condition one, is that if a true
appears in the chain, it has to be the last value. Condition two, that all chains consist of the same amount of entries.
This means, if say my chain length would be 3, then the following chains are allowed:
b r e a k
observation1, observation2, false
observation1, observation2, false
observation1, observation2, true
b r e a k
b r e a k
observation1, observation2, false
observation1, observation2, false
observation1, observation2, false
b r e a k
but not
b r e a k
observation1, observation2, false
observation1, observation2, true
observation1, observation2, false
b r e a k
A chain like this would also be allowed
observation1, observation2, false
observation1, observation2, false
observation1, observation2, false
observation1, observation2, true
as I could simply throw the first line away to get a length of 3.
But not a chain like this:
observation1, observation2, false
b r e a k
observation1, observation2, false
observation1, observation2, true
b r e a k
This means I need some way (my guess would be pandas) to filter the .csv file and find all occurences, where between to b r e a k
lines there are at least x amount of false
datapoints followed by a true
or another false
.
What would be a good way of achieving this filtering?
CodePudding user response:
I found a solution myself after tinkering around some more. I will post in case anyone else ever stumbles upon this:
What I did, was leave the chain length of x out for now and simply filtered for full false
chains or chains of falses
until first true
. I also did not change the base file as initially intended, but wrote to a new file.
For my case, I then padded all chains but the longest with 0
observations to ensure unified chain length across all input chains. (Not included in this code example, there are many tutorials found online).
One could also cut all chains but the longest I guess, but I did not really try that.
Anyway, this code filters chains to the above described standard:
import csv
input_file = 'pathtofile'
output_file = 'outfile'
with open(input_file, 'r') as inp, open(output_file, 'a') as out:
writer = csv.writer(out)
reader = csv.reader(inp)
helper_value = 0
for row in reader:
if(row[0]=='b r e a k'):
helper_value = 0
writer.writerow('s')
else:
if(helper_value==0):
gt = row[3]
if(gt=='false'):
writer.writerow(row)
else:
writer.writerow(row)
helper_value=1
The writer.writerow('s')
is to ensure seperate cases are kept seperate.