Number of file : 894 total file size : 22.2GB I have to do machine learning by reading many csv files. There is not enough memory to read at once.
CodePudding user response:
Specifically to load a large number of files that do not fit in memory, one can use dask
:
import dask.dataframe as dd
df = dd.read_csv('file-*.csv')
This will create a lazy version of the data, meaning the data will be loaded only when requested, e.g. df.head()
will load the data from the first 5 rows only. Where possible pandas
syntax will apply to dask
dataframes.
For machine learning you can use dask-ml
which has tight integration with sklearn
, see docs.
CodePudding user response:
You can read your files in chunks but not during the training phase. You have to select an appropriate algorithm for your files. However, having such big files for model training mostly means you have to do some data preparation first, which will reduce the size of the files significantly.