Which file format uses less memory in python?-CodePudding

I wrote the code for points generation which will generate a dataframe for every one second and it keeps on generating. Each dataframe has 1000 rows and 7 columns.. It was implemented using while loop and thus for every iteration one dataframe is generated and it must be appended on a file. While file format should I use to manage the memory efficiency? Which file format takes less memory.? Can anyone give me a suggestion.. Is it okay to use csv? If so what datatype should I prefer to use.. Currently my dataframe has int16 values.. Should I append the same or should I convert it into binary format or byte format?

CodePudding user response：

numpy arrays can be stored in binary format. Since you you have a single int16 data type, you can create a numpy array and write that. You would have 2 bytes per int16 value which is fairly good for size. The trick is that you need to know the dimensions of the stored data when you read it later. In this example its hard coded. This is a bit fragile - if you change your mind and start using different dimensions later, old data would have to be converted.

Assuming you want to read a bunch of 1000x7 dataframes later, you could do something like the example below. The writer keeps appending 1000x7 int16s and the reader chunks them back into dataframes. If you don't use anything specific to pandas itself, you would be better off just sticking with numpy for all of your operations and skip the demonstrated conversions.

import pandas as pd
import numpy as np

def write_df(filename, df):
    with open(filename, "ab") as fp:
        np.array(df, dtype="int16").tofile(fp)

def read_dfs(filename, dim=(1000,7)):
    """Sequentially reads dataframes from a file formatted as raw int16
    with dimension 1000x7"""
    size = dim[0] * dim[1]
    with open(filename, "rb") as fp:
        while True:
            arr = np.fromfile(fp, dtype="int16", count=size)
            if not len(arr):
                break
            yield pd.DataFrame(arr.reshape(*dim))

import os

# ready for test
test_filename = "test123"
if os.path.exists(test_filename):
    os.remove(test_filename)
    
df = pd.DataFrame({"a":[1,2,3], "b":[4,5,6]})

# write test file
for _ in range(5):
    write_df(test_filename, df)
    
# read and verify test file
return_data = [df for df in read_dfs(test_filename, dim=(3,2))]
assert len(return_data) == 5