Home > Software design >  Replace function in dataframe removes more than expected
Replace function in dataframe removes more than expected

Time:01-05

Suppose I have the following csv file:

Column1;Column2
AB; Test

Which has a NUL byte character in it:

n1

I want to read it in with pandas, manipulate it and export it again. I want to remove NUL byte characters, my code is as follows:

import pandas as pd
df=pd.read_csv(r'C:\folder\test.csv', sep=";", encoding="utf-8")
df=df.replace('\x00','', regex=True) 
df.to_csv(r'C:\folder\testoutput.csv', sep=";", index=False, encoding="utf-8")

I expect that only the NUL byte character is removed, however, more is removed:

n2

So it removes the whole "cell", also the "Test". But I don't want this.

Why, where is my mistake? How can I only remove the NUL byte character?

CodePudding user response:

The issue is not coming from the replace, but rather from the fact that the NULL character is handled as a line terminator by read_csv.

What you can do it manually remove the NULL before passing it to read_csv:

import io

f = io.StringIO(open(r'C:\folder\test.csv').read().replace('\x00', ''))

df = pd.read_csv(f, sep=";", encoding="utf-8")

Output:

  Column1 Column2
0      AB    Test

Alternatively, if you just want to remove the NULL without pandas:

with open(r'C:\folder\test.csv', 'r') as f1, open(r'C:\folder\test_clean.csv', 'w') as f2:
    for line in f1:
        f2.write(line.replace('\x00', ''))
  • Related