Home > Software engineering >  Can we force Python/Pandas to flush to disk immediately?
Can we force Python/Pandas to flush to disk immediately?

Time:10-06

I have a setup where a python script (let's call it test1.py) is spawning a subprocess which executes test2.py. In test2.py, I have some pandas operations which ultimately builds a dataframe test. The final step in test2.py is saving the dataframe to csv (test.to_csv('my_path')). On completion of test2.py, test1.py continues execution and the next step required is to load the same csv file created (i.e., test = pd.read_csv('my_path')).

Now, the issue is that Python is not flushing the buffer to disk, and therefore, when test1.py goes to read the csv file, I get a FileNotFoundError. Of course, if I stop the script, the file is saved to disk. Is there a way to force pandas to flush to disk immediately? I've read about using file.flush() and os.fsync(fd) - but this don't seem to apply to my case since I'm not dealing with any file descriptors.

EDIT: Added a (significantly) simplified example

test1.py looks something like:

import subprocess


def main():
    cmd = ['python3', 'test2.py']
    output_bytes = subprocess.check_output(cmd, stderr=subprocess.STDOUT, timeout=900)
    output = output_bytes.decode('utf-8')
    # test2.py finished, so I want to read the csv
    df = pd.read_csv('my_path')


if __name__ == '__main__':
    main()

test2.py looks something like:

import pandas as pd
import numpy as np

def main():
    df = pd.DataFrame(np.random.randint(0, 100, size=(100, 4)), columns=list('ABCD'))
    df.to_csv('my_path')

if __name__ == '__main__':
    main()

CodePudding user response:

but this don't seem to apply to my case since I'm not dealing with any file descriptors.

You do not have to use filename as 1st argument for .to_csv, as pandas.DataFrame.to_csv docs says you might use

file-like object implementing a write() function.

therefore you can do something like this

import pandas as pd
df = pd.DataFrame({"x":[1,2,3]})
f = open("file.csv","w",newline="")
df.to_csv(f)
f.flush()
f.close()

Observe that if you open file in non-binary mode, then you need to disengage universal newlines.

  • Related