KEY POINT: the Dataset is so large that I am barely able to store it in hardware. (PetaBytes)
Say I have trillions and trillion of rows in a dataset. This dataset is too large to be stored in memory. I want to train a machine learning model, say logisitc regression, on this dataset. How do I go about this?
Now, I know Amazon/Google does machine learning on huge amounts of data. How do they go about it? For example, click dataset, where globally each smart devices' inputs are stored in a dataset.
Desperately looking for new ideas and open to corrections.
My train of thoughts:
- load a part of data into the memory
- Perform gradient descent
This way the optimization is mini batch descent.
Now the problem is, in the optimization, be it SGD or mini batch, it stops when it has gone through ALL the data in the worst case. Traversing the whole dataset is not possible.
So I had the idea of early stopping. Early stopping reserves a validation set and will stop optimization when the error stops going down/converges on the validation set. But again this might not be feasible due to the size of the dataset.
Now I am thinking of simply random sampling a training set and a test set, with workable sizes to train the model.
CodePudding user response:
Pandas read function loads the entire data into ram, which can be an issue.To solve this process the data in chunks.
CodePudding user response:
In case of a huge amount of data, you can use batches for training the dataset. Use complex models such as Neural networks, xgboost instead of Logistic Regression.
CodePudding user response:
Check out this website for more information on how to handle big data.