Home > Blockchain >  Python or PySpark Regular Expression for leading or trailing defined string
Python or PySpark Regular Expression for leading or trailing defined string

Time:02-05

I am working through a huge list of package names for customers which need to be parsed to find out price information. Sample package names are as follows:

  1. Jan24_Package1_USD2_Rest_Of_String
  2. Jan25_Package2_2USD_Rest_Of_String
  3. Jan26_Package3_USD_2_Rest_Of_String
  4. Jan24_Package4_2_USD_Rest_Of_String

So for first and third string USD is leading the value 2 and for the rest ones USD is trailing. Looking for a regular expression which will find output 2 in all use cases.

I was trying with group 3 (\d ) for the following

(USD)(_*)(\d )(_*)

This works fine for string 1 and 3, but it doesn't work with string 2 and 4.

Looking for a solution here. Thanks a lot.

CodePudding user response:

It could be solved using two possible cases (capture group 2 or 3 in regexp):

import re
strings = ['Jan24_Package1_USD2_Rest_Of_String', 
           'Jan25_Package2_2USD_Rest_Of_String', 
           'Jan26_Package3_USD_2_Rest_Of_String', 
           'Jan24_Package4_2_USD_Rest_Of_String']

for string in strings:
    match = re.search(r'.*_(USD_?(\d )|(\d )_?USD)', string)
    if match:
        #print group 2 or group 3 if group 2 is empty
        if match.group(2):
            print(match.group(2))
        else:
            print(match.group(3))
  • Related