Home > Net >  regex in python: Can you filter string by deliminator with exceptions?
regex in python: Can you filter string by deliminator with exceptions?

Time:07-09

I am trying to parse a long string of 'objects' enclosed by quotes delimitated by commas. EX:

s='"12345","X","description of x","X,Y",,,"345355"'

output=['"12345"','"X"','"description of x"','"X,Y"','','','"345355"']

I am using split to delimitate by commas:


s=["12345","X","description of x","X,Y",,,"345355"]
s.split(',')

This almost works but the output for the string segment ...,"X,Y",... ends up parsing the data enclosed by quotes to "X and Y". I need the split to ignore commas inside of quotes

Split_Output

Is there a way I can delaminate by commas except for in quotes?

I tried using a regex but it ignores the ...,,,... in data because there are no quotes for blank data in the file I'm parsing. I am not an expert with regex and this sample I used from Python split string on quotes. I do understand what this example is doing and not sure how I could modify it to allow parse data that is not enclosed by quotes.

Thanks!

Regex_Output

CodePudding user response:

this should work:

In [1]: import re

In [2]: s = '"12345","X","description of x","X,Y",,,"345355"'

In [3]: pattern = r"(?<=[\",]),(?=[\",])"

In [4]: re.split(pattern, s)
Out[4]: ['"12345"', '"X"', '"description of x"', '"X,Y"', '', '', '"345355"']

Explanation:

  • (?<=...) is a "positive lookbehind assertion". It causes your pattern (in this case, just a comma, ",") to match commas in the string only if they are preceded by the pattern given by .... Here, ... is [\",], which means "either a quotation mark or a comma".
  • (?=...) is a "positive lookahead assertion". It causes your pattern to match commas in the string only if they are followed by the pattern specified as ... (again, [\",]: either a quotation mark or a comma).
  • Since both of these assertions must be satisfied for the pattern to match, it will still work correctly if any of your 'objects' begin or end with commas as well.

CodePudding user response:

You can replace all quotes with empty string.

s='"12345","X","description of x","X,Y",,,"345355"'
s=s.replace('"','')
splits = s.split(",")

CodePudding user response:

split by " (quote) instead of by , (comma) then it will split the string into a list with extra commas, then you can just remove all elements that are commas

s='"12345","X","description of x","X,Y",,,"345355"'

temp = s.split('"')
print(temp)
#> ['', '12345', ',', 'X', ',', 'description of x', ',', 'X,Y', ',,,', '345355', '']
values_to_remove = ['', ',', ',,,']

result = list(filter(lambda val: not val in values_to_remove, temp))

print(result)
#> ['12345', 'X', 'description of x', 'X,Y', '345355']
  • Related