I'm looking for guidance on how to efficiently filter out unneeded parts of my data before converting to a numpy array and/or pandas dataframe. Data is delivered to my program as string buffers (each record separately), and I'm currently using np.frombuffer
to construct an array once all records are retrieved.
The problem I'm having is that individual records can be quite long, thousands of fields, and sometimes I only want a small subset of them. Filtering out these unneeded fields adds steps and significantly slows down the data import though.
Without any filtering, my current process is:
# assume some function here that retrieves one record at a time and appends it to 'data'
data = [b'\x00\x00\x00\x00\x00\x00\xf0?one \x00\x00\x00\x00\x00\x00Y@',
b'\x00\x00\x00\x00\x00\x00\x00@two \x00\x00\x00\x00\x00\x00i@',
b'\x00\x00\x00\x00\x00\x00\x08@three \x00\x00\x00\x00\x00\xc0r@',
b'\x00\x00\x00\x00\x00\x00\x10@four \x00\x00\x00\x00\x00\x00y@']
final_data = b''.join(data)
arr = np.frombuffer(final_data, dtype=struct_dtypes)
df = pd.DataFrame(arr)
# dataframe
n1 ch n2
0 1.0 one 100.0
1 2.0 two 200.0
2 3.0 three 300.0
3 4.0 four 400.0
My current solution for filtering is essentially:
final_data = b''.join(b''.join(buffer[offset: offset 8] for offset in [0, 16]) for buffer in data)
struct_dtypes = np.dtype([('n1', 'd'), ('n2', 'd')])
arr = np.frombuffer(final_data, dtype=struct_dtypes)
df = pd.DataFrame(arr)
n1 n2
0 1.0 100.0
1 2.0 200.0
2 3.0 300.0
3 4.0 400.0
That middle step to slice and rejoin each record makes filtering slower than just reading everything. If I construct the full array first and then return only the specified columns, isn't that just a waste of memory? What's an appropriate way to read only the portions of the string buffers I want?
CodePudding user response:
You can specify an offset for each field during dtype construction:
struct_dtypes = np.dtype({'names': ['n1', 'n2'], 'formats': ['d', 'd'], 'offsets': [0, 16]})
or
struct_dtypes = np.dtype({'n1': ('d', 0), 'n2': ('d', 16)})