Home > OS >  Regex: getting the follow data into groups
Regex: getting the follow data into groups

Time:01-21

I've got the following 2 records:

Input
Marvel Comics Presents12 (1982) #125
Marvel Comics Presents #1427 (1988)

I want to parse it into the following format using RegEx:

Title Year Serial Number
Marvel Comics Presents12 (1982) #125
Marvel Comics Presents (1988) #1427

I do know basic RegEx but feel like I'm a little lackluster here. Is there a specific topic within RegEx that helps with this type of problem?

CodePudding user response:

Try creating match groups for what's inside the parentheses and the number after the #, then use the same RegEx again to replace that text with nothing. Like this:

import re


def extract(el):
    year = int(re.search(r'\((.*)\)', el).group(1))
    el = re.sub(r'\(.*\)', '', el)
    serial = int(re.search(r'#(\d*)', el).group(1))
    el = re.sub(r'#\d*', '', el)
    return {'year': year, 'serial': serial, 'title': el.strip()}


data = ['Marvel Comics Presents12 (1982) #125', 'Marvel Comics Presents #1427 (1988)']
data = [extract(el) for el in data]
print(data)  # => [{'year': 1982, 'serial': 125, 'title': 'Marvel Comics Presents12'}, {'year': 1988, 'serial': 1427, 'title': 'Marvel Comics Presents'}]

The RegExs here are:

  1. \((.*)\) to match what is inside the parentheses
  2. #(\d*) to match the number after the # symbol.

I removed the match groups from the RegExs that replace text because they are not needed and might speed up the code a bit.

  • Related