I have a problem. I have a corrupted csv file. The last column is a free text and my seperator is ;
unfortunately some users use ;
in the free text, e.g. This is a longer text and;ups that should not be
. I want to read now the file line by line and after the second ;
everything should be replaced to a ,
. I print out which line of this csv file corrupted. How could read the file and replace it at the same time? Or should I save the lines the output and replace it after?
Unfortunately I do not know how could I solve this kind of problem.
import pandas as pd
with open("sample.csv", encoding="UTF-8") as file:
for i, line in enumerate(file):
x = line.split(";")
if(len(x) > 3):
print(i, ": ", line)
cleaned_x = (', '.join(x[2:]))
# Add cleaned_x to x
new_line = x[0] ";" x[1] ";" cleaned_x
print(new_line)
df = pd.read_csv("file.csv", encoding="utf-8", sep=";")
What I have
customerId;name;text
1;Josey;I want to go at 05pm
2;Mike;Check this out --> öl
2;Frank;This is a longer text and;ups that should not be
2;Max;okay;
3;Josey;here is everythink good
What I want
customerId;name;text
1;Josey;I want to go at 05pm
2;Mike;Check this out --> öl
2;Frank;This is a longer text and,ups that should not be
2;Max;okay,
3;Josey;here is everythink good
CodePudding user response:
You can save the lines in an array and create a new file.
import csv
new_sample = []
with open("sample.csv", encoding="UTF-8") as file:
for i, line in enumerate(file):
x = line.split(";")
if(len(x) > 3):
print(i, ": ", line)
cleaned_x = (', '.join(x[2:]))
# Add cleaned_x to x
new_line = x[0] ";" x[1] ";" cleaned_x
print(new_line)
new_sample.append(new_line)
else:
new_sample.append(line)
with open("new_sample.csv", encoding="UTF-8") as new_file:
writer = csv.writer(new_file)
for row in new_sample:
writer.writerow(row)
CodePudding user response:
Define a custom func to read the csv file, then create a new dataframe from rows
and cols
:
def read_csv(path):
with open(path) as file:
for line in file:
*v, t = line.strip().split(';', 2)
yield [*v, t.replace(';', ',')]
cols, *rows = read_csv('sample.csv')
df = pd.DataFrame(rows, columns=cols)
print(df)
customerId name text
0 1 Josey I want to go at 05pm
1 2 Mike Check this out --> öl
2 2 Frank This is a longer text and,ups that should not be
3 2 Max okay,
4 3 Josey here is everythink good
CodePudding user response:
FYI, if you use Python's csv library to write the initial file it will handle embeddded ; correctly
import csv
with open("test.csv", "w") as f:
writer = csv.writer(f, delimiter=";")
writer.writerow(["hello", "world", "hello;world"])
# test.csv contains hello;world;"hello;world"
# which will be read as three fields using csv.reader
Here's how to fix your issue. I'll write out to a new file. It is possible to open a file in read/write mode but it's more complicated since you need to read a line, move the position in the file, write the new data while making sure you don't overwrite the bytes of the next line... It's much easier to use a new file then rename it.
import csv
with open("input.csv") as in_file, open("output.csv", "w") as out_file:
reader = csv.reader(in_file, delimiter=";")
writer = csv.writer(out_file, delimiter=";")
for line in reader: # line is a list containing the fields
if len(line) > 3:
line = line[:2] [", ".join(line[2:])]
writer.writerow(line)
If you don't need to save the fixed file you don't need to open "output.csv" or create the writer. Just print line
after it's been corrected to display the list of fields ["hello", "world", "hello;world"]
If you prefer to print the string that would end up in the file you will need to wrap fields containing semicolons in quotes.
line = [f"\"{item}\"" if ";" in item else item for item in line]
print(";".join(line))
# hello;world;"hello;world"
CodePudding user response:
Pandas (version >= 1.3.0
) allows a function to be called to process a bad line when encountering one with the on_bad_lines parameter:
callable, function with signature (bad_line: list[str]) -> list[str] | None that will process a single bad line. bad_line is a list of strings split by the sep. If the function returns None, the bad line will be ignored. If the function returns a new list of strings with more elements than expected, a ParserWarning will be emitted while dropping extra elements. Only supported when engine="python"
So you could simply read the file with:
df = pd.read_csv('sample.csv', sep=';', engine='python', on_bad_lines=lambda x: x[:2] [';'.join(x[2:])])
Then save it to whatever format you prefer. Or to achieve the output defined in the question:
df['text'] = df['text'].str.replace(';', ',')
df.to_csv('output.csv', sep=';')