Remove certian character from line by reading a file and save it to the file-CodePudding

I have a problem. I have a corrupted csv file. The last column is a free text and my seperator is ; unfortunately some users use ; in the free text, e.g. This is a longer text and;ups that should not be. I want to read now the file line by line and after the second ; everything should be replaced to a ,. I print out which line of this csv file corrupted. How could read the file and replace it at the same time? Or should I save the lines the output and replace it after?

Unfortunately I do not know how could I solve this kind of problem.

import pandas as pd

with open("sample.csv", encoding="UTF-8") as file:
    for i, line in enumerate(file):
      x = line.split(";")
      if(len(x) > 3):
        print(i, ": ", line)
        cleaned_x = (', '.join(x[2:]))
        # Add cleaned_x to x
        new_line = x[0]   ";"   x[1]    ";"   cleaned_x
        print(new_line)

df = pd.read_csv("file.csv", encoding="utf-8", sep=";")

What I have

customerId;name;text
1;Josey;I want to go at 05pm
2;Mike;Check this out --> öl
2;Frank;This is a longer text and;ups that should not be
2;Max;okay;
3;Josey;here is everythink good

What I want

customerId;name;text
1;Josey;I want to go at 05pm
2;Mike;Check this out --> öl
2;Frank;This is a longer text and,ups that should not be
2;Max;okay,
3;Josey;here is everythink good

CodePudding user response：

Define a custom func to read the csv file, then create a new dataframe from rows and cols:

def read_csv(path):
    with open(path) as file:
        for line in file:
            *v, t = line.strip().split(';', 2)
            yield [*v, t.replace(';', ',')]

cols, *rows = read_csv('sample.csv')
df = pd.DataFrame(rows, columns=cols)

print(df)
  customerId   name                                              text
0          1  Josey                              I want to go at 05pm
1          2   Mike                            Check this out --> Ã¶l
2          2  Frank  This is a longer text and,ups that should not be
3          2    Max                                             okay,
4          3  Josey                           here is everythink good

CodePudding user response：

FYI, if you use Python's csv library to write the initial file it will handle embeddded ; correctly

import csv

with open("test.csv", "w") as f:
    writer = csv.writer(f, delimiter=";")
    writer.writerow(["hello", "world", "hello;world"])

# test.csv contains hello;world;"hello;world"
# which will be read as three fields using csv.reader

Here's how to fix your issue. I'll write out to a new file. It is possible to open a file in read/write mode but it's more complicated since you need to read a line, move the position in the file, write the new data while making sure you don't overwrite the bytes of the next line... It's much easier to use a new file then rename it.

import csv

with open("input.csv") as in_file, open("output.csv", "w") as out_file:

    reader = csv.reader(in_file, delimiter=";")
    writer = csv.writer(out_file, delimiter=";")

    for line in reader:  # line is a list containing the fields
        if len(line) > 3:
            line = line[:2]   [", ".join(line[2:])]
        writer.writerow(line)

If you don't need to save the fixed file you don't need to open "output.csv" or create the writer. Just print line after it's been corrected to display the list of fields ["hello", "world", "hello;world"]

If you prefer to print the string that would end up in the file you will need to wrap fields containing semicolons in quotes.

line = [f"\"{item}\"" if ";" in item else item for item in line]
print(";".join(line))
# hello;world;"hello;world"

CodePudding user response：

Pandas (version >= 1.3.0) allows a function to be called to process a bad line when encountering one with the on_bad_lines parameter:

callable, function with signature (bad_line: list[str]) -> list[str] | None that will process a single bad line. bad_line is a list of strings split by the sep. If the function returns None, the bad line will be ignored. If the function returns a new list of strings with more elements than expected, a ParserWarning will be emitted while dropping extra elements. Only supported when engine="python"

So you could simply read the file with:

df = pd.read_csv('sample.csv', sep=';', engine='python', on_bad_lines=lambda x: x[:2]   [';'.join(x[2:])])

Then save it to whatever format you prefer. Or to achieve the output defined in the question:

df['text'] = df['text'].str.replace(';', ',')
df.to_csv('output.csv', sep=';')

CodePudding user response：

You can save the lines in an array and create a new file.

import csv

new_sample = []
with open("sample.csv", encoding="UTF-8") as file:
for i, line in enumerate(file):
    x = line.split(";")
    if(len(x) > 3):
        print(i, ": ", line)
        cleaned_x = (', '.join(x[2:]))
        # Add cleaned_x to x
        new_line = x[0]   ";"   x[1]    ";"   cleaned_x
        print(new_line)
        new_sample.append(new_line)
    else:
        new_sample.append(line)

with open("new_sample.csv", "w", encoding="UTF-8") as new_file:  # Open in write mode.
    writer = csv.writer(new_file)
    for row in new_sample:
        writer.writerow(row)