I have a dataset composed of many csv files. Each file contains a series of dates and a number, and each one is an independent series and not the rest of the file before it, and the goal is to predict the next date and the number associated to it for each individual csv in the future. I would like to use a LSTM to solve this problem, but I don't know how to feed the data to it.
here is a sample of the data:
year | month | day | amount |
---|---|---|---|
2020 | 09 | 06 | 12.50 |
2020 | 09 | 10 | 12.50 |
2020 | 09 | 19 | 124.00 |
2020 | 10 | 2 | 13.06 |
2020 | 10 | 06 | 12.50 |
for the moment I wrote some code to put that separates the data into training and test (by name of files and a ratio of 75% to 25%). Here is the code:
INPUT_DATA_DIR = "dir/"
TRAIN_DATA_COEFFICIENT = 0.75
files = []
for (dirpath, dirnames, filenames) in os.walk(INPUT_DATA_DIR):
files.extend(filenames)
break
train_files_finish = int(len(files) * TRAIN_DATA_COEFFICIENT)
train_files = files[0:train_files_finish]
validation_files = files[train_files_finish:len(files)]
CodePudding user response:
If you don't know where to start, take a look at https://www.tensorflow.org/tutorials/structured_data/time_series, which covers the very basics.
The number of csv files is irrelevant, you can always concatenate your data to prepare it for the modeling.
CodePudding user response:
You should not concatenate datasets of independent time series. The best solution really depends on many factor including how large each dataset is, how important and/or relevant each dataset is, based on what procedure the data has been obtained for each dataset, etc.
In case you have at least one sufficiently large and insightful dataset, using it for training your model can be your first step.