How to deal with "_csv.Error: line contains NULL byte"?-CodePudding

I am trying to fix an issue I'm having with null bytes in a CSV files.

The csv_file object is being passed in from a different function in my Flask application:

stream = codecs.iterdecode(csv_file.stream, "utf-8-sig", errors="strict")
dict_reader = csv.DictReader(stream, skipinitialspace=True, restkey="INVALID")


for row in dict_reader:  # Error is thrown here
    ...

The error thrown in the console is _csv.Error: line contains NULL byte.

So far, I have tried:

different encoding types (I checked the encoding type and it is utf-8-sig)
using .replace('\x00', '')

but I can't seem to get these null bytes to be removed.

I would like to remove the null bytes and replace them with empty strings, but I would also be okay with skipping over the row that contains the null bytes; I am unable to share my csv file.

EDIT: The solution I reached:

    content = csv_file.read()

    # Converting the above object into an in-memory byte stream
    csv_stream = io.BytesIO(content)

    # Iterating through the lines and replacing null bytes with empty 
    string
    fixed_lines = (line.replace(b'\x00', b'') for line in csv_stream)


    # Below remains unchanged, just passing in fixed_lines instead of csv_stream

    stream = codecs.iterdecode(fixed_lines, 'utf-8-sig', errors='strict')

    dict_reader = csv.DictReader(stream, skipinitialspace=True, restkey="INVALID")

CodePudding user response：

I think your question definitely needs to show a sample of the stream of bytes you expect from csv_file.stream.

I like pushing myself to learn more about Python's approach to IO, encoding/decoding, and CSV, so I've worked this much out for myself, but probably don't expect others to.

import csv
from codecs import iterdecode
import io

# Flask's file.stream is probably BytesIO, see https://stackoverflow.com/a/18246385 
# and the Gist in the comment, https://gist.github.com/lost-theory/3772472?permalink_comment_id=1983064#gistcomment-1983064

csv_bytes = b'''\xef\xbb\xbf C1, C2
 r1c1, r1c2
 r2c1, r2c2, r2c3\x00'''

# This is what Flask is probably giving you
csv_stream = io.BytesIO(csv_bytes)

# Fixed lines is another iterator, `(line.repl...)` vs. `[line.repl...]`
fixed_lines = (line.replace(b'\x00', b'') for line in csv_stream)

decoded_lines = iterdecode(fixed_lines, 'utf-8-sig', errors='strict')

reader = csv.DictReader(decoded_lines, skipinitialspace=True, restkey="INVALID")

for row in reader:
    print(row)

and I get:

{'C1': 'r1c1', 'C2': 'r1c2'}
{'C1': 'r2c1', 'C2': 'r2c2', 'INVALID': ['r2c3']}