Splitting up a string by regex with multiple capture groups and OR

import re
splitRegex = r"(Personal Info|Personal|Personal Information)|(Work Experience|Work)|(Education|School|Certificates)"

text = "Personal Info\nText\nText\nText\nText\nWork Experience\nText\nText\nText\nText\nEducation\nText\nText\nText\nText\nText"

x = [tuple(i.splitlines()) for i in re.split(splitRegex, text) if i != ""]
d = dict([("".join(x[i]), x[i   1]) for i in range(0, len(x) // 2, 2)])
print(d)

In the above example code I want to split up the text based on titles. these titles I want to determine by regex (as there can be synonyms) re.split however often returns NoneType. This obviously causes errors. If I add an if condition to check if i IS NOT None, then the errors disappear but the dictionary will end up missing a lot of data.

Would anyone know of a way to fix this or of a way to achieve the same thing?

Keep in mind that the above is just an example. I need to use this for CV's/Resume's and as such the lay-out and titles can be slightly different depending on which CV is used.

CodePudding user response：

You're getting None for all the groups that don't match, since re.split() includes all capture groups in the resulting list.

You should put each list of alternatives in a non-capturing group, and then put all of them in a single capturing group so you just get the matching label.

splitRegex = r"((?:Personal Info|Personal|Personal Information)|(?:Work Experience|Work)|(?:Education|School|Certificates))"

CodePudding user response：

You can get rid of the NoneType's in the result at the loop level:

    x = [tuple(i.splitlines()) for i in re.split(splitRegex, text) if (i != "" and i != None)]

This way x will be something like:

    [('Personal Info',), ('', 'Text', 'Text', 'Text', 'Text'), 
    ('Work Experience',), ('', 'Text', 'Text', 'Text', 'Text'), 
    ('Education',), ('', 'Text', 'Text', 'Text', 'Text', 'Text')]

which will probably make your dict constructor happy.