I am currently using parquet files due to their outstanding read-in time. However now I am looking to change functionality of my program slightly. Files will become too large for the memory and instead I wish to read in only specific rows of the files.
The files have around 15gb of data each (and I will be using multiple files), with several hundred columns, and millions of rows. If I wanted to read in e.g. only row x, operate on that, and then read in a new row (millions of times over), what would be the most efficient file type by which to do this?
I am not too concerned about compression, as it is ram that is my limiting factor, rather than storage.
Thanks in advance for your expertise!
CodePudding user response:
Most likely you will not get everything right processing your date. If the raw data is stored as CSV, save not just debugging time and convert CSV to parquet using i.e.:
CodePudding user response:
Depending on exact requirements, I'd look at:
Note that Rocks DB will produce multiple files in a single directory rather than an individual file. Last i looked Rocks DB did not support secondary indexes so you are stuck with whatever choice you make for the Key, unless you want to rewrite the data. The RocksDB project does not have python bindings but there are a few floating around on github.
SQLLite, at least, for the initial load might be pretty slow (I would recommend loading then creating an index on "row number") after the initial load. But it allows for creating secondary indices and finding multiple rows at a time by those indices reasonably efficiently.