Home > Net >  Returning empty string for missing capture group Python regex
Returning empty string for missing capture group Python regex

Time:02-22

I'm working on parsing string text containing information on university, year, degree field, and whether or not a person graduated. Here are two examples:

ex1 = 'BYU: 1990 Bachelor of Arts Theater (Graduated):BYU: 1990 Bachelor of Science Mathematics (Graduated):UNIVERSITY OF VIRGINIA: 1995 Master of Science Mechanical Engineering (Graduated):MICHIGAN STATE UNIVERSITY: 2008 Master of Fine Arts INDUSTRIAL DESIGN (Graduated)'

ex2 = 'UCSD: 2001 Bachelor of Arts English:UCLA: 2005 Bachelor of Science Economics (Graduated):UCSD 2010 Master of Science Economics'

What I am struggling to accomplish is to have an entry for each school experience regardless of whether specific information is missing. In particular, imagine I wanted to pull whether each degree was finished from ex1 and ex2 above. When I try to use re.findall I end up with something like the following for ex1:

# Code:
re.findall('[A-Z ] : \d  Bachelor [^:] \((Graduated)', ex1)

# Output:
['Graduated', 'Graduated']

which is what I want, two entries for two Bachelor's degrees. For ex2, however, one of the Bachelor's degrees was unfinished so the text does not contain "(Graduated)", so the output is the following:

# Code:
re.findall('[A-Z ] : \d  Bachelor [^:] \((Graduated)', ex2)

# Output:
['Graduated']

# Desired Output:
['', 'Graduated']

I have tried making the capture group optional or including the colon after graduated and am not making much headway. The example I am using is the "Graduated" information, but in principle the more general question remains if there is an identifiable degree but it is missing one or two pieces of information (like graduation year or university). Ultimately I am just looking to have complete information on each degree, including whether certain pieces of information are missing. Thank you for any help you can provide!

CodePudding user response:

You can use the ?-Quantifier to match "Graduated" (and the paranthesis () between 0 and n times.

re.findall('[A-Z ] : \d  Bachelor [^:()]*\(?(Graduated)?', ex2)

Output:

>>> re.findall('[A-Z ] : \d  Bachelor [^:()]*\(?(Graduated)?', ex2)
['', 'Graduated']

CodePudding user response:

How about this?

[re.sub('[(:)]', '', t) for t in [re.sub('^[^\(] ','', s) for s in re.findall('[A-Z ] : \d  Bachelor [^:] :', ex1)]]
# output ['Graduated', 'Graduated']

[re.sub('[(:)]', '', t) for t in [re.sub('^[^\(] ','', s) for s in re.findall('[A-Z ] : \d  Bachelor [^:] :', ex2)]]
# output ['', 'Graduated']

  • Related