Home > Blockchain >  Splitting up a string by regex with multiple capture groups and OR | operators
Splitting up a string by regex with multiple capture groups and OR | operators

Time:12-08

import re
splitRegex = r"(Personal Info|Personal|Personal Information)|(Work Experience|Work)|(Education|School|Certificates)"

text = "Personal Info\nText\nText\nText\nText\nWork Experience\nText\nText\nText\nText\nEducation\nText\nText\nText\nText\nText"

x = [tuple(i.splitlines()) for i in re.split(splitRegex, text) if i != ""]
d = dict([("".join(x[i]), x[i   1]) for i in range(0, len(x) // 2, 2)])
print(d)

In the above example code I want to split up the text based on titles. these titles I want to determine by regex (as there can be synonyms) re.split however often returns NoneType. This obviously causes errors. If I add an if condition to check if i IS NOT None, then the errors disappear but the dictionary will end up missing a lot of data.

Would anyone know of a way to fix this or of a way to achieve the same thing?

Keep in mind that the above is just an example. I need to use this for CV's/Resume's and as such the lay-out and titles can be slightly different depending on which CV is used.

CodePudding user response:

You're getting None for all the groups that don't match, since re.split() includes all capture groups in the resulting list.

You should put each list of alternatives in a non-capturing group, and then put all of them in a single capturing group so you just get the matching label.

splitRegex = r"((?:Personal Info|Personal|Personal Information)|(?:Work Experience|Work)|(?:Education|School|Certificates))"

CodePudding user response:

You can get rid of the NoneType's in the result at the loop level:

    x = [tuple(i.splitlines()) for i in re.split(splitRegex, text) if (i != "" and i != None)]

This way x will be something like:

    [('Personal Info',), ('', 'Text', 'Text', 'Text', 'Text'), 
    ('Work Experience',), ('', 'Text', 'Text', 'Text', 'Text'), 
    ('Education',), ('', 'Text', 'Text', 'Text', 'Text', 'Text')]

which will probably make your dict constructor happy.

  • Related