import re
splitRegex = r"(Personal Info|Personal|Personal Information)|(Work Experience|Work)|(Education|School|Certificates)"
text = "Personal Info\nText\nText\nText\nText\nWork Experience\nText\nText\nText\nText\nEducation\nText\nText\nText\nText\nText"
x = [tuple(i.splitlines()) for i in re.split(splitRegex, text) if i != ""]
d = dict([("".join(x[i]), x[i 1]) for i in range(0, len(x) // 2, 2)])
print(d)
In the above example code I want to split up the text based on titles. these titles I want to determine by regex (as there can be synonyms) re.split
however often returns NoneType. This obviously causes errors. If I add an if condition to check if i IS NOT None
, then the errors disappear but the dictionary will end up missing a lot of data.
Would anyone know of a way to fix this or of a way to achieve the same thing?
Keep in mind that the above is just an example. I need to use this for CV's/Resume's and as such the lay-out and titles can be slightly different depending on which CV is used.
CodePudding user response:
You're getting None
for all the groups that don't match, since re.split()
includes all capture groups in the resulting list.
You should put each list of alternatives in a non-capturing group, and then put all of them in a single capturing group so you just get the matching label.
splitRegex = r"((?:Personal Info|Personal|Personal Information)|(?:Work Experience|Work)|(?:Education|School|Certificates))"
CodePudding user response:
You can get rid of the NoneType
's in the result at the loop level:
x = [tuple(i.splitlines()) for i in re.split(splitRegex, text) if (i != "" and i != None)]
This way x
will be something like:
[('Personal Info',), ('', 'Text', 'Text', 'Text', 'Text'),
('Work Experience',), ('', 'Text', 'Text', 'Text', 'Text'),
('Education',), ('', 'Text', 'Text', 'Text', 'Text', 'Text')]
which will probably make your dict constructor happy.