Training a model with several large CSV files-CodePudding

I have a dataset composed of several large csv files. Their total size is larger than the RAM of the machine on which the training is executed.

I need to train an ML model from Scikit-Learn or TF or pyTorch (Think SVR, not deep learning). I need to use the whole dataset which is impossible to load at once. Any recommendation on how to overcome this, please?

CodePudding user response：

I have been in this situation before and my suggestion would be take a step back and look at the problem again.

Does your model absolutely need all of the data at once? Or can it be done in batches? It's also possible that the model you are using can be done in batches, but the library you are using does not support such a case. In that situation, either try to find a library that does support batches or if such a library does not exist (unlikely), "reinvent the wheel" yourself, i.e., create the model from scratch and allow batches. However, as your question mentioned, you need to use a model from Scikit-Learn, TensorFlow, or PyTorch. So if you truly want to stick with your mentioned libraries, there are techniques such as those that Alexey Larionov and I'mahdi mentioned in comments to your question in relation to PyTorch and TensorFlow.

Is all of your data actually relevant? Once I found that a whole subset of my data was useless to the problem I was trying to solve; another time I found that it was only marginally helpful. Dimensionality reduction, numerosity reduction, and statistical modeling may be your friends here. Here is a link to a wikipedia page about data reduction:

https://en.wikipedia.org/wiki/Data_reduction

Not only will data reduction reduce the amount of memory you need, it will also improve your model. Bad data in means bad data out.