Home > database >  clean csv file with python
clean csv file with python

Time:04-19


I have a csv file that i'm trying to clean up with python.
it has lines separated by << \n >> or empty lines.
I would like each line that does not end with << " >> to be cut/pasted to the previous line.
here is a concrete example to be more explicit!\

CSV FILE I HAVE

*"id","name","age","city","remark"\
"1","kevin","27","paris","This is too bad"\
"8","angel","18","london","Incredible !!!"\
"14","maria","33","madrid","i can't believe it."\
"16","john","28","new york","hey men,\
\nhow do you did this"\
"22","naima","35","istanbul","i'm sure it's false,\
\
\nit can't be real"
"35","marco","26","roma","you'r my hero!"\
"39","lili","37","tokyo","all you need to knows.\
\n\nthe best way to upgrade easely"\
...*

CSV FILE I WOULD LIKE TO HAVE

*"id","name","age","city","remark"\
"1","kevin","27","paris","This is too bad"\
"8","angel","18","london","Incredible !!!"\
"14","maria","33","madrid","i can't believe it."\
"16","john","28","new york","hey men,how do you did this"\
"22","naima","35","istanbul","i'm sure it's false, it can't be real"\
"35","marco","26","roma","you'r my hero!"\
"39","lili","37","tokyo","all you need to knows. the best way to upgrade easely"\
...*

someone would be how to do?
thank you in advance for your help !

i'm actually try this python code -->

text = open("input.csv", "r", encoding='utf-8') 
  
text = ''.join([i for i in text])  
  
text = text.replace("\\n", "")
 
x = open("output.csv","w") 
  
x.writelines(text) 
x.close()

CodePudding user response:

for this_row in read_file.readlines():
    if not this_row.startswith('"'):
        prev_row = prev_row.rstrip('\n')   this_row
    else:
        write_file.write(prev_row)
        prev_row = this_row

Just a draft. You can use str.join with list-cache to gain enhancement

CodePudding user response:

There are a few points to make here:

  1. Your CSV files contains , characters inside your remarks. This means the field must be enclosed in quotes (which it is).

  2. A CSV file is allowed to contain a newline within a single field. This does not result in an extra data row, but it does make the file weird to read for a human.

  3. Python's CSV reader will automatically handle newlines in fields.

  4. Finally, your data appears to be encoded strangely and you wish to remove all the extra newlines. Each row also has a trailing backslash character that should not be there.

I suggest the following approach:

  1. Use Python's CSV reader to correctly read one row at a time (you have 7 rows a header).
  2. Remove any newlines from the remark field.

For example:

import csv

with open('input.csv') as f_input, open('output.csv', 'w', newline='') as f_output:
    csv_input = csv.reader(f_input)
    csv_output = csv.writer(f_output)

    for row in csv_input:
        if len(row) == 5:       # skip blank lines
            row[4] = row[4].replace('\n', '').replace('\\n', ' ').replace('\\', '')
            csv_output.writerow(row)

This would give you:

id,name,age,city,remark\
1,kevin,27,paris,This is too bad
8,angel,18,london,Incredible !!!
14,maria,33,madrid,i can't believe it.
16,john,28,new york,"hey men, how do you did this"
22,naima,35,istanbul,"i'm sure it's false, it can't be real"
35,marco,26,roma,you'r my hero!
39,lili,37,tokyo,all you need to knows.  the best way to upgrade easely

CodePudding user response:

input.csv file content:

"id","name","age","city","remark"
"1","kevin","27","paris","This is too bad"
"8","angel","18","london","Incredible !!!"
"14","maria","33","madrid","i can't believe it."
"16","john","28","new york","hey men,
how do you did this"
"22","naima","35","istanbul","i'm sure it's false,

nit can't be real"
"35","marco","26","roma","you'r my hero!"
"39","lili","37","tokyo","all you need to knows.

the best way to upgrade easely"

Possible solution is the following:

with open('input.csv', 'r', encoding='utf-8') as file:
    data = file.read()
    
clean_data = data.replace('"\n"', '"||"').replace("\n", "").replace('"||"', '"\n"')
    
with open('output.csv', 'w', encoding='utf-8') as file:
    file.write(clean_data)

Returns output.csv content:

"id","name","age","city","remark"
"1","kevin","27","paris","This is too bad"
"8","angel","18","london","Incredible !!!"
"14","maria","33","madrid","i can't believe it."
"16","john","28","new york","hey men,how do you did this"
"22","naima","35","istanbul","i'm sure it's false,nit can't be real"
"35","marco","26","roma","you'r my hero!"
"39","lili","37","tokyo","all you need to knows.the best way to upgrade easely"
  • Related