Suppose I have the following csv file:
Column1;Column2
AB; Test
Which has a NUL byte character in it:
I want to read it in with pandas, manipulate it and export it again. I want to remove NUL byte characters, my code is as follows:
import pandas as pd
df=pd.read_csv(r'C:\folder\test.csv', sep=";", encoding="utf-8")
df=df.replace('\x00','', regex=True)
df.to_csv(r'C:\folder\testoutput.csv', sep=";", index=False, encoding="utf-8")
I expect that only the NUL byte character is removed, however, more is removed:
So it removes the whole "cell", also the "Test". But I don't want this.
Why, where is my mistake? How can I only remove the NUL byte character?
CodePudding user response:
The issue is not coming from the replace
, but rather from the fact that the NULL character is handled as a line terminator by read_csv
.
What you can do it manually remove the NULL before passing it to read_csv
:
import io
f = io.StringIO(open(r'C:\folder\test.csv').read().replace('\x00', ''))
df = pd.read_csv(f, sep=";", encoding="utf-8")
Output:
Column1 Column2
0 AB Test
Alternatively, if you just want to remove the NULL without pandas:
with open(r'C:\folder\test.csv', 'r') as f1, open(r'C:\folder\test_clean.csv', 'w') as f2:
for line in f1:
f2.write(line.replace('\x00', ''))