I'm loading from very big and not very well formed csv file a pandas dataframe.
toy dataset to demonstrate (line 3 and 4 have more than 2 columns):
$ cat data.csv
n|car
7|u
1|x||
6|y|
9|z
2|t
$
Once loaded I get most of the lines loaded in the dataframe plus in stderr the list of lines with a mismatched number of columns (saved as err.log
).
$ cat load.py
import pandas as pd
df = pd.read_csv('data.csv', dtype=str, sep='|', on_bad_lines='warn')
df.head(5)
$
$ python load.py 2> err.log
n car
0 7 u
1 9 z
2 2 t
$
$ cat err.log
b'Skipping line 3: expected 2 fields, saw 4\nSkipping line 4: expected 2 fields, saw 3\n'
$
The I use the lines numbers provided by err.log
to get the raw lines with errors for further human analysis.
Could it be possible to have access to this lines with error from inside the Python code, without the need of 2> err.log
from the calling (I do not have direct access to the execution to do the 2> err.log
) ?
CodePudding user response:
you can redirect stderr from inside python to an in-memory buffer as follows , note you won't get any error on your screen if an error happens ... so if you want to see errors, you must catch it and rewire stderr back to stdout then raise the error again.
import sys
import io
buff = io.StringIO()
sys.stderr = buff
import pandas as pd
df = pd.read_csv('file.txt', dtype=str, sep='|', on_bad_lines='warn')
df.head(5)
print(buff.getvalue())
lines = buff.getvalue()[2:-4].split('\\n')
for line in lines:
print(line)
b'Skipping line 3: expected 2 fields, saw 4\nSkipping line 4: expected 2 fields, saw 3\n'
Skipping line 3: expected 2 fields, saw 4
Skipping line 4: expected 2 fields, saw 3