I've got the following 2 records:
Input |
---|
Marvel Comics Presents12 (1982) #125 |
Marvel Comics Presents #1427 (1988) |
I want to parse it into the following format using RegEx:
Title | Year | Serial Number |
---|---|---|
Marvel Comics Presents12 | (1982) | #125 |
Marvel Comics Presents | (1988) | #1427 |
I do know basic RegEx but feel like I'm a little lackluster here. Is there a specific topic within RegEx that helps with this type of problem?
CodePudding user response:
Try creating match groups for what's inside the parentheses and the number after the #
, then use the same RegEx again to replace that text with nothing. Like this:
import re
def extract(el):
year = int(re.search(r'\((.*)\)', el).group(1))
el = re.sub(r'\(.*\)', '', el)
serial = int(re.search(r'#(\d*)', el).group(1))
el = re.sub(r'#\d*', '', el)
return {'year': year, 'serial': serial, 'title': el.strip()}
data = ['Marvel Comics Presents12 (1982) #125', 'Marvel Comics Presents #1427 (1988)']
data = [extract(el) for el in data]
print(data) # => [{'year': 1982, 'serial': 125, 'title': 'Marvel Comics Presents12'}, {'year': 1988, 'serial': 1427, 'title': 'Marvel Comics Presents'}]
The RegExs here are:
\((.*)\)
to match what is inside the parentheses#(\d*)
to match the number after the#
symbol.
I removed the match groups from the RegExs that replace text because they are not needed and might speed up the code a bit.