Filtering byte stream efficiently before converting to numpy array / pandas dataframe-CodePudding

I'm looking for guidance on how to efficiently filter out unneeded parts of my data before converting to a numpy array and/or pandas dataframe. Data is delivered to my program as string buffers (each record separately), and I'm currently using np.frombuffer to construct an array once all records are retrieved.

The problem I'm having is that individual records can be quite long, thousands of fields, and sometimes I only want a small subset of them. Filtering out these unneeded fields adds steps and significantly slows down the data import though.

Without any filtering, my current process is:

# assume some function here that retrieves one record at a time and appends it to 'data'

data = [b'\x00\x00\x00\x00\x00\x00\xf0?one     \x00\x00\x00\x00\x00\x00Y@',
        b'\x00\x00\x00\x00\x00\x00\x00@two     \x00\x00\x00\x00\x00\x00i@',
        b'\x00\x00\x00\x00\x00\x00\x08@three   \x00\x00\x00\x00\x00\xc0r@',
        b'\x00\x00\x00\x00\x00\x00\x10@four    \x00\x00\x00\x00\x00\x00y@']

final_data = b''.join(data)

arr = np.frombuffer(final_data, dtype=struct_dtypes)
df = pd.DataFrame(arr)

# dataframe
    n1     ch     n2
0  1.0    one  100.0
1  2.0    two  200.0
2  3.0  three  300.0
3  4.0   four  400.0

My current solution for filtering is essentially:

final_data = b''.join(b''.join(buffer[offset: offset   8] for offset in [0, 16]) for buffer in data)

struct_dtypes = np.dtype([('n1', 'd'), ('n2', 'd')])
arr = np.frombuffer(final_data, dtype=struct_dtypes)
df = pd.DataFrame(arr)

    n1     n2
0  1.0  100.0
1  2.0  200.0
2  3.0  300.0
3  4.0  400.0

That middle step to slice and rejoin each record makes filtering slower than just reading everything. If I construct the full array first and then return only the specified columns, isn't that just a waste of memory? What's an appropriate way to read only the portions of the string buffers I want?

CodePudding user response：

You can specify an offset for each field during dtype construction:

struct_dtypes = np.dtype({'names': ['n1', 'n2'], 'formats': ['d', 'd'], 'offsets': [0, 16]})

struct_dtypes = np.dtype({'n1': ('d', 0), 'n2': ('d', 16)})