Home > Back-end >  REGEX - Extraction between range and also ignore certain word (python)
REGEX - Extraction between range and also ignore certain word (python)

Time:05-10

I need to extract value of vehicle class that satisfies multiple scenarios, hence tried to extract between range class and date but for few sample data unwanted values like holder and tolder need to be ignored. I have tried with or condition as well but unable to exclude those words

Tried Regex :

  1. (?<=Class\s)[a-z A-Z(-|\s|\)]*(?=Date|TOLDER)
  2. (?<=Class\s)[a-z A-Z(-|\s|\)]*(?=Date)

sample data 1 : Vehicle Class LMV MCWG Date of Issue

sample data 2 : Vehicle Class MCWG Date of issue

sample data 3 : Vehicle Class LMV MCWG Date of issue

sample data 4 : Vehicle Class LMV MCWOG TOLDER SIGNATURE Date of Issue

sample data 5 : Vehicle Class MCWG LMV LMV-GV PSVBUS Date of issue

sample data 6 : Vehicle Class LMY MCWG HOLDER SIGNATURE Date of Issue

Expected output : value between Class and Date (for eg : in sample data 1 : LMV MCWG, in sample data 6 : LMY MCWG, where it should ignore HOLDER SIGNATURE)

CodePudding user response:

You can use the pattern (MC[A-Z] ).*(LM[A-Z] )|(LM[A-Z] ).*(MC[A-Z] )
see https://regex101.com/r/08lN88/1

CodePudding user response:

You can match either HOLDER or TOLDER using a character class. Instead of lookarounds you can capture the data that you want in a capture group.

In the character class you are using \s which also matches a space, and if you want to match a pipe char you can use a single | (note that it does not mean OR in a character class)

To prevent a partial word match, you can add word boundaries \b

\bClass\s([a-zA-Z(|)\s-]*?)\s*(?:Date|[HT]OLDER)\b

See a regex demo.

import re

pattern = r"\bClass\s([a-zA-Z(|)\s-]*?)\s*(?:Date|[HT]OLDER)\b"

s = ("sample data 1 :\n"
            "Vehicle Class\n"
            "LMV\n"
            "MCWG\n"
            "Date of Issue\n\n"
            "sample data 2 :\n"
            "Vehicle Class MCWG\n"
            "Date of issue\n\n\n"
            "sample data 3 : \n"
            "Vehicle Class LMV MCWG\n"
            "Date of issue\n\n"
            "sample data 4 :\n"
            "Vehicle Class LMV MCWOG\n"
            "TOLDER SIGNATURE\n"
            "Date of Issue \n\n"
            "sample data 5 :\n"
            "Vehicle Class MCWG LMV LMV-GV PSVBUS\n"
            "Date of issue\n\n"
            "sample data 6 :\n"
            "Vehicle Class LMY MCWG\n"
            "HOLDER SIGNATURE\n"
            "Date of Issue ")

print(re.findall(pattern, s))

Output

['LMV\nMCWG', 'MCWG', 'LMV MCWG', 'LMV MCWOG', 'MCWG LMV LMV-GV PSVBUS', 'LMY MCWG']
  • Related