big data dataframe from an on-disk mem-mapped binary struct format from python, pandas, dask, numpy-CodePudding

I have timeseries data in sequential (packed c-struct) format in very large files. Each structure contain K fields of different types in some order. The file is essentially an array of these structures (row-wise). I would like to be able to mmap the file and map each field to a numpy array (or another form) where the array recognizes a stride (the size of the struct) aliased as columns in a dataframe.

An example struct might be:

struct {
   int32_t a;
   double b;
   int16_t c;
}

Such a file of records could be generated with python as:

from struct import pack, unpack
db = open("binarydb", "wb")
for i in range(1,1000):
    packed = pack('<idh', i, i*3.14, i*2)
    db.write(packed)
db.close()

The question is then how to view such a file efficiently as a dataframe. If we assume the file is hundreds of millions of rows in length would need to use a mem-map solution.

Using memmap, how can i map a numpy array (or alternative array structure) to the sequence of integers for column a. It seems to me that would need to be able to indicate a stride (14 bytes) and offset (0 in this case) for the int32 series "a", an offset of 4 for the float64 series "b", and an offset of 12 for the int16 series "c".

I have seen that one can easily create a numpy array against a mmap'ed file if the file contains a single dtype. Is there a way to pull the different series in this file by indicating a type, offset, and stride? With this approach could present mmapped columns to pandas or another dataframe implementation.

Even better, is there a simple way to integrate a custom mem-mapped format into Dask, such that get the benefits of lazy paging into the file?

CodePudding user response：

You can use numpy.memmap to do that. Since your input data type is not a native type, you need to use advanced Numpy data types. Note that you need the size of the array ahead of time since Numpy does not support unbounded streams but fixed-size array.

size = 999

datatype = np.dtype([('a', np.int32), ('b', np.float64), ('c', np.int16)])

# The final memory-mapped array
data = np.memmap("binarydb", dtype=datatype, mode='write', shape=size)

for i in range(1,1 size):
    data[i]['a'] = i
    data[i]['b'] = i*3.14
    data[i]['c'] = i*2

Note that vectorized operation are generally much faster than direct indexing in Numpy. Numba can also be used to speed up the direct indexing if the operation cannot be vectorized.

Note that memory mapped area can be flushed but not yet closed in Numpy.

CodePudding user response：

Extrapolating from @Jérôme Richard's answer above. Here is code to read from a binary sequence of records:

size = 999
datatype = np.dtype([('a', np.int32), ('b', np.float64), ('c', np.int32)])

# The final memory-mapped array
data = np.memmap("binarydb", dtype=datatype, mode='readonly', shape=size)

Can then pull each series as:

data['a']
data['b']
data['c']