I have a setup where a python script (let's call it test1.py
) is spawning a subprocess which executes test2.py
. In test2.py
, I have some pandas operations which ultimately builds a dataframe test
. The final step in test2.py
is saving the dataframe to csv (test.to_csv('my_path')
). On completion of test2.py
, test1.py
continues execution and the next step required is to load the same csv file created (i.e., test = pd.read_csv('my_path')
).
Now, the issue is that Python is not flushing the buffer to disk, and therefore, when test1.py
goes to read the csv file, I get a FileNotFoundError
. Of course, if I stop the script, the file is saved to disk. Is there a way to force pandas to flush to disk immediately? I've read about using file.flush()
and os.fsync(fd)
- but this don't seem to apply to my case since I'm not dealing with any file descriptors.
EDIT: Added a (significantly) simplified example
test1.py
looks something like:
import subprocess
def main():
cmd = ['python3', 'test2.py']
output_bytes = subprocess.check_output(cmd, stderr=subprocess.STDOUT, timeout=900)
output = output_bytes.decode('utf-8')
# test2.py finished, so I want to read the csv
df = pd.read_csv('my_path')
if __name__ == '__main__':
main()
test2.py
looks something like:
import pandas as pd
import numpy as np
def main():
df = pd.DataFrame(np.random.randint(0, 100, size=(100, 4)), columns=list('ABCD'))
df.to_csv('my_path')
if __name__ == '__main__':
main()
CodePudding user response:
but this don't seem to apply to my case since I'm not dealing with any file descriptors.
You do not have to use filename as 1st argument for .to_csv
, as pandas.DataFrame.to_csv
docs says you might use
file-like object implementing a write() function.
therefore you can do something like this
import pandas as pd
df = pd.DataFrame({"x":[1,2,3]})
f = open("file.csv","w",newline="")
df.to_csv(f)
f.flush()
f.close()
Observe that if you open file in non-binary mode, then you need to disengage universal newlines.