I am trying to fix an issue I'm having with null bytes in a CSV files.
The csv_file
object is being passed in from a different function in my Flask application:
stream = codecs.iterdecode(csv_file.stream, "utf-8-sig", errors="strict")
dict_reader = csv.DictReader(stream, skipinitialspace=True, restkey="INVALID")
for row in dict_reader: # Error is thrown here
...
The error thrown in the console is _csv.Error: line contains NULL byte
.
So far, I have tried:
- different encoding types (I checked the encoding type and it is utf-8-sig)
- using
.replace('\x00', '')
but I can't seem to get these null bytes to be removed.
I would like to remove the null bytes and replace them with empty strings, but I would also be okay with skipping over the row that contains the null bytes; I am unable to share my csv file.
EDIT: The solution I reached:
content = csv_file.read()
# Converting the above object into an in-memory byte stream
csv_stream = io.BytesIO(content)
# Iterating through the lines and replacing null bytes with empty
string
fixed_lines = (line.replace(b'\x00', b'') for line in csv_stream)
# Below remains unchanged, just passing in fixed_lines instead of csv_stream
stream = codecs.iterdecode(fixed_lines, 'utf-8-sig', errors='strict')
dict_reader = csv.DictReader(stream, skipinitialspace=True, restkey="INVALID")
CodePudding user response:
I think your question definitely needs to show a sample of the stream of bytes you expect from csv_file.stream
.
I like pushing myself to learn more about Python's approach to IO, encoding/decoding, and CSV, so I've worked this much out for myself, but probably don't expect others to.
import csv
from codecs import iterdecode
import io
# Flask's file.stream is probably BytesIO, see https://stackoverflow.com/a/18246385
# and the Gist in the comment, https://gist.github.com/lost-theory/3772472?permalink_comment_id=1983064#gistcomment-1983064
csv_bytes = b'''\xef\xbb\xbf C1, C2
r1c1, r1c2
r2c1, r2c2, r2c3\x00'''
# This is what Flask is probably giving you
csv_stream = io.BytesIO(csv_bytes)
# Fixed lines is another iterator, `(line.repl...)` vs. `[line.repl...]`
fixed_lines = (line.replace(b'\x00', b'') for line in csv_stream)
decoded_lines = iterdecode(fixed_lines, 'utf-8-sig', errors='strict')
reader = csv.DictReader(decoded_lines, skipinitialspace=True, restkey="INVALID")
for row in reader:
print(row)
and I get:
{'C1': 'r1c1', 'C2': 'r1c2'}
{'C1': 'r2c1', 'C2': 'r2c2', 'INVALID': ['r2c3']}