Home > Back-end >  REGEX for complex strings
REGEX for complex strings

Time:11-03

I have the lines from a CSV file:

  1. 315,"Misérables, Les (1995)",Drama|War

  2. 315,Big Bully (1996),Comedy|Drama

I want to split the line and make a list of 3 elements and I need a general REGEX expression that splits where it encounters ',' but since the title may have a comma (As shown in the first line), I need to skip the parsing of the title. A title that has commas has also quotation marks but I need the expression to work for both cases. Is it possible doing it with REGEX?

I'm trying to learn REGEX by myself and I'm having difficulties understanding some cases. I could really appreciate your help!

CodePudding user response:

If you're trying to parse a .csv file, don't do it by hand, Python already has loads of libraries that will do it for you.

Otherwise if your string has quotation marks when there is a comma inside the title, and doesn't have them when there is not, you can do it like this:

>>> x = '315,"Misérables, Les (1995)",Drama|War'
>>> y = '315,Big Bully (1996),Comedy|Drama'
>>> x
'315,"Misérables, Les (1995)",Drama|War'
>>> y
'315,Big Bully (1996),Comedy|Drama'

>>> x.split('"') if len(x.split('"')) == 3 else x.split(',')
['315,', 'Misérables, Les (1995)', ',Drama|War']
>>> y.split('"') if len(y.split('"')) == 3 else y.split(',')
['315', 'Big Bully (1996)', 'Comedy|Drama']

This leaves the comma inside the first and last part (if it's split by a quotation mark), so you will have to remove them afterwards manually.

CodePudding user response:

Actually, you do not need to use REGEX for this problem. QUOTING will solve this.

For example:

filereader = csv.reader(csv_input_file, delimiter=',', quotechar='"')

give it a try to solve your problem

  • Related