Home > Blockchain >  Cleaning a CSV string to remove unwanted rows in memory
Cleaning a CSV string to remove unwanted rows in memory

Time:07-08

I have a CSV that is semicolon separated, and that uses double quotes as a quote character. I get this CSV from an API, the raw data looks like this :

1;"[email protected]";"Toto";"Tata"
2;"BAD_DATA"
"BAD_DATA"
3;"alice@alice;com";"Alice";"Dupont"
4;"[email protected]";"Bob";"Morane"

The second and third lines are like this because of encoding problems at the source, I cannot change this.

I would like to change the string to this :

1;"[email protected]";"Toto";"Tata"
3;"alice@alice;com";"Alice";"Dupont"
4;"[email protected]";"Bob";"Morane"

So that all the lines with the wrong length are deleted. But since my code will run on a server, I would like if possible to do all of this in memory, without writing to a file on the machine.

As you can see, some of the fields can contain semicolons so I need to protect those with double quotes around them. It adds a little complexity.

So far, I have this :

import csv

myString = f'''1;"[email protected]";"Toto";"Tata"
2;"BAD_DATA"
"BAD_DATA"
3;"alice@alice;com";"Alice";"Dupont"
4;"[email protected]";"Bob";"Morane"'''

lines =  myString.splitlines()

reader = csv.reader(lines, delimiter=';', quotechar='"')

for row in reader:
    if len(row) == 4:
        # I don't know what to do here

I anybodys knows in what direction I can try to advance, that would be very nice :)

CodePudding user response:

Your code correctly filters out rows that don't contain exactly 4 columns already. You can use a csv.writer to write what passes the filter to an IO.StringIO file object so you can get the output as a string with the object's getvalue method. However, since the first column of your expected output is unquoted and is the only numeric column, you should configure your writer to use the csv.QUOTE_NONNUMERIC quoting option, and convert the first column to an integer before passing the columns to the writer's writerows method for output:

import csv
from io import StringIO

myString = f'''1;"[email protected]";"Toto";"Tata"
2;"BAD_DATA"
"BAD_DATA"
3;"alice@alice;com";"Alice";"Dupont"
4;"[email protected]";"Bob";"Morane"'''

output = StringIO()
reader = csv.reader(StringIO(myString), delimiter=';')
writer = csv.writer(output, delimiter=';', quoting=csv.QUOTE_NONNUMERIC)
writer.writerows((int(id), *rest) for id, *rest in reader if len(rest) == 3)
print(output.getvalue())

This outputs:

1;"[email protected]";"Toto";"Tata"
3;"alice@alice;com";"Alice";"Dupont"
4;"[email protected]";"Bob";"Morane"

CodePudding user response:

you can use list comprehension and .join method of str,

result_str = ''
for row in reader:
    if len(row) == 4:
        result_str  = ';'.join(str(x) for x in row)   '\n'

print(result_str.strip())

which prints:

1;[email protected];Toto;Tata
3;alice@alice;com;Alice;Dupont
4;[email protected];Bob;Morane
  • Related