Home > Back-end >  Using regex to parse data between delimeter and ending at a specified substring
Using regex to parse data between delimeter and ending at a specified substring

Time:09-02

I'm trying to parse out the names from a bunch of semi-unpredictable strings. More specifically, I'm using ruby, but I don't think that should matter much. This is a contrived example but some example strings are:

Eagles vs Bears
NFL Matchup: Philadelphia Eagles VS Chicago Bears TUNE IN
NFL Matchup: Philadelphia Eagles VS Chicago Bears - TUNE IN
Philadelphia Eagles vs Chicago Bears - NFL Match
Phil.Eagles vs Chic.Bears
3agles vs B3ars

The regex I've come up with is

([0-9A-Z .]*) vs ([0-9A-Z .]*)(?:[ -:]*tune)?/i

but in the case of "NFL Matchup: Philadelphia Eagles VS Chicago Bears TUNE IN" I'm receiving Chicago Bears TUNE as the second match. I'm trying to remove "tune in" so it's in it's own group.

I thought that by adding (?:[ -:]*tune)? it would separate the ending portion of the expression the same way that having vs in the middle was able to, but that doesnt seem to be the case. If I remove the ? at the end, it matches correctly for the above example, but it no longer matches for Eagles vs Bears

If anyone could help me, I would greatly appreciate it if you could breakdown your regex piece by piece.

CodePudding user response:

You can capture the second group up to a -, : or tune preceded with zero or more whitespaces or till end of the line while making the second group pattern lazy:

([\w .]*) vs ([\w .]*?)(?=\s*(?:[:-]|tune|$))

See the regex demo.

Details:

  • ([\w .]*) - Group 1: zero or more word, space or . chars as many as possible
  • vs - a vs string
  • ([\w .]*?) - Group 2: zero or more word, space or . chars as few as possible
  • (?=\s*(?:[:-]|tune|$)) - a positive lookahead that requires the following pattern to appear immediately to the right of the current location:
    • \s* - zero or more whitespaces
    • (?:[:-]|tune|$) - : or -, tune or end of a line.
  • Related