String with optional double quotes converted to a list of lists-CodePudding

I do have some possible string formats that I am trying to convert to a list but I am having some trouble when there are double quotes in the string , which is supposed to be converted to a list.
For the simple format of: 'test 3,9,tag1,G,comment 1'
can be easily converted with the command : values = line.split(',') and get the result of values = ['test 3','9','tag1','G','comment 1'] .

The problem starts when the format of the string looks like : 'test 4,77,tag2,Y,"comment 2, comment 3"' which should be converted to a list like : ['test 4','77','tag2','Y','"comment 2, comment 3"']

Or the format: 'test 6,1,"tag2, tag3, tag4",Y,"comment 2, comment 3"' which should be converted to ['test 6','1','"tag2, tag3, tag4"','Y','"comment 2, comment 3"'] .

Obviously, if I use the line.split(',') command to the above strings I will get the list ['test 6','1','"tag2', 'tag3', 'tag4"','Y','"comment 2', 'comment 3"'] which is not the correct one.

My next step is going to be converting the information inside the double quotes to a list like : ['tag2', 'tag3', 'tag4'] , but this will be very simple once I manage to convert them to a separate string .

I considered using a regex solution but I don't seem to get it right. Any help would be appreciated.

CodePudding user response：

Don't re-invent the wheel. There are lots of edge conditions for parsing CSVs that the existing csv module handles well. Use the csv.reader() functionality to read the columns correctly and then you can post-process columns that need it:

input.csv

test 3,9,tag1,G,comment 1
test 4,77,tag2,Y,"comment 2, comment 3"
test 6,1,"tag2, tag3, tag4",Y,"comment 2, comment 3"

test.py

import csv

with open('input.csv','r',newline='') as f:
    r = csv.reader(f)
    for line in r:
        line[1] = int(line[1])        # sample post-read processing
        line[2] = line[2].split(', ')
        line[4] = line[4].split(', ')
        print(line)

Output:

['test 3', 9, ['tag1'], 'G', ['comment 1']]
['test 4', 77, ['tag2'], 'Y', ['comment 2', 'comment 3']]
['test 6', 1, ['tag2', 'tag3', 'tag4'], 'Y', ['comment 2', 'comment 3']]

Edit per comment

Per csv.reader docs:

csvfile can be any object which supports the iterator protocol and returns a string each time its __next__() method is called — file objects and list objects are both suitable.

From the comment it appears you have a byte stream. Simply wrap it in a TextIOWrapper that decodes the stream so it returns strings instead of bytes:

import csv
import io

# This is a file-like object returning a byte stream,
# like OPs comment.
csv_file = open('input.csv','rb')

with io.TextIOWrapper(csv_file,encoding='utf8') as f:
    r = csv.reader(f)
    for line in r:
        line[1] = int(line[1])        # sample post-read processing
        line[2] = line[2].split(', ')
        line[4] = line[4].split(', ')
        print(line)

And if you already have the data in a string, wrap it in StringIO:

import csv
import io

data = '''\
test 3,9,tag1,G,comment 1
test 4,77,tag2,Y,"comment 2, comment 3"
test 6,1,"tag2, tag3, tag4",Y,"comment 2, comment 3"
'''

with io.StringIO(data) as f:
    r = csv.reader(f)
    for line in r:
        line[1] = int(line[1])        # sample post-read processing
        line[2] = line[2].split(', ')
        line[4] = line[4].split(', ')
        print(line)

csv.reader also supports iterating over lists of strings, so this works as well:

import csv

data = '''\
test 3,9,tag1,G,comment 1
test 4,77,tag2,Y,"comment 2, comment 3"
test 6,1,"tag2, tag3, tag4",Y,"comment 2, comment 3"
'''

r = csv.reader(data.splitlines())
for line in r:
    line[1] = int(line[1])        # sample post-read processing
    line[2] = line[2].split(', ')
    line[4] = line[4].split(', ')
    print(line)

CodePudding user response：

If I understood your question correctly. This might not be the best solution but it does the job.

def my_splitter(input_string):
    temp_list = input_string.split('"')
    temp_list = [item.split(',') for item in temp_list]
    output_list = []
    for item in temp_list:
        output_list  = [s for s in item if s]
    return output_list

input_string = 'test 6,1,"tag2, tag3, tag4",Y,"comment 2, comment 3"'
output_list =  my_splitter(input_string)

>>> output_list
['test 6', '1', 'tag2', 'tag3', 'tag4', 'Y', 'comment 2', 'comment 3']

In the function, first you split the string by ("). And you get the following list:

['test 6,1,', 'tag2, tag3, tag4', ',Y,', 'comment 2, comment 3', '']

Next, you split each element in that list by (,) and you get the following list:

[['test 6', '1', ''], ['tag2', ' tag3', ' tag4'], ['', 'Y', ''], ['comment 2', ' comment 3'], ['']]

Finally, you loop through each sub-list and strip the elements of wite space in the start and end, and only append the items if they are not just white space. This is the final output:

['test 6', '1', 'tag2', 'tag3', 'tag4', 'Y', 'comment 2', 'comment 3']