Home > Back-end >  How to store a set of arrays for deep learning not consuming too much memory (Python)?
How to store a set of arrays for deep learning not consuming too much memory (Python)?

Time:03-20

I`m trying to make a research in which the observations of my dataset are represented by matrices (arrays composed of numbers, similar to how images for deep learning are represented, but mine are not images) of different shapes.

What I`ve already tried is to write those arrays as lists in one column of a pandas dataframe and then save this as a csv\excel. After that I planned just to load such a file and convert those lists to arrays of appropriate shapes and then to convert a set of such arrays to a tensor which I finally will use for training the deep model in keras.

But it seems like this method is extremely inefficient, cause only 1/6 of my dataset has already occupied about 6 Gb of memory (pandas saved as csv) which is huge and I won't be able to load it in RAM (I'm using google colab to run my experiments).

So my question is: is there any other way of storing a set of arrays of different shapes, which won`t occupy so much memory? Maybe I can store tensors directly somehow? Or maybe there are some ways to store pandas in some compressed types of files which are not so heavy?

CodePudding user response:

Are you storing purely (or mostly) continuous variables? If so, maybe you could reduce the accuracy (i.e., from float64 to float32) these variables if you don't need need such an accurate value per datapoint.

There's a bunch of ways in reducing the size of your data that's being stored in your memory, and the what's written is one of the many ways to do so. Maybe you could break the process that you've mentioned into smaller chunks (i.e., storage of data, extraction of data), and work on each chunk/stage individually, which hopefully will reduce the overall size of your data!

Otherwise, you could perhaps take advantage of database management systems (SQL or NoSQL depending on which fits best) which might be better, though querying that amount of data might constitute yet another issue.

I'm by no means an expert in this but I'm just explaining more of how I've dealt with excessively large datasets (similar to what you're currently experiencing) in the past, and I'm pretty sure someone here will probably give you a more definitive answer as compared to my 'a little of everything' answer. All the best!

CodePudding user response:

Yes, Avoid using csv/excel for big datasets, there are tons of data formats out there, for this case I would recommend to use acompressed format like pd.Dataframe.to_hdf, pd.Dataframe.to_parket or pd.Dataframe.to_pickle.

There are even more formats to choose and compression options within the functions (for example to_hdf takes the argument complevel that you can set to 9 ).

  • Related