Home > OS >  Python - Use regex to extract substrings between two markers
Python - Use regex to extract substrings between two markers

Time:08-24

I have a problem that I need help with. I have the below strings and need to do the following:

  1. Extract all the substrings between the equal sign and "END=STRING" string or closing double quotation mark.
  2. Group the extracted substrings into a single group
  3. Do not show the starting and ending markers in the output
  4. If possible, do not show the back slashes or newlines

Two samples of extended result:

STRING database file 2025.01 ABC_ONE ABC_TWO

STRING database file 2025.01 ABC_ONE:12.3456 ABC_TWO:12.3456 ABC_THREE:12.3456 ABC_FOUR:12.3456 ABC_THREE:12.3456 ABC_FOUR:12.3456 ABC_FIVE:12.3456 ABC_SIX:12.3456 ABC_SEVEN:12.3456 ABC_EIGHT:12.3456 ABC_NINE:12.3456 ABC_TEN:12.3456

I will use Python re.finditer to loop through the results I get from regex. Also, re.MULTILINE and re.IGNORECASE will be used.

Link to what I have on regex101: https://regex101.com/r/CwMaEZ/1

Feel free to suggest a different pattern but keep in mind the following:

  1. Groups are needed like how I show in my pattern.
  2. I want to iterate over the results in Python so I prefer re.finditer

Here is the regex I have so far:

(STRING)\s([a-zA-Z0-9/ ._-] )\s([a-zA-Z0-9/ ._-] )\s([a-zA-Z0-9/ ._-] )\s([a-zA-Z0-9/ ._-] )?\s?\\?\n?(.*VALUE=\s*\"?)

Here are the strings:

STRING database file 2025.01 \
     0123456789ABCD VALUE="ABC_ONE \
     ABC_TWO " END=STRING
     ST=

STRING database file 2025.01 \
     0123456789ABCD VALUE=ABC_ONE \
     ABC_TWO END=STRING
     ST=

STRING database file 2025.01 ABCDEFGH123456 \
     VALUE=ABC_ONE ABC_TWO END=STRING 

STRING database file 2025.01 \
    VALUE=ABC_ONE:12.3456 END=STRING \
    AAAA=ABCDEFGH1234

STRING database file 2025.01 \
    VALUE="ABC_ONE:12.3456 ABC_TWO:12.3456 \
    ABC_THREE:12.3456 ABC_FOUR:12.3456 " \
    END=STRING \

STRING database file 2025.01 \
    0123456789ABCD VALUE="ABC_ONE ABC_TWO " \
    END=STRING 

STRING database file 2025.01 VALUE="ABC_ONE \
    ABC_TWO ABC_THREE END=STRING

STRING database file 2025.01 \
    VALUE="ABC_ONE ABC_TWO ABC_THREE " END=STRING

STRING database file 2025.01 VALUE=
    "ABC_ONE ABC_TWO ABC_THREE " END=STRING \

STRING database file 2025.01 VALUE="ABC_ONE \
    ABC_TWO ABC_THREE " END=STRING

STRING database file 2025.01 VALUE="ABC_ONE ABC_TWO \
    ABC_THREE " END=STRING

STRING database file 2025.01 VALUE="ABC_ONE ABC_TWO ABC_THREE " \

STRING database file 2025.01 \
    VALUE="ABC_ONE:12.3456 ABC_TWO:12.3456 \
    ABC_THREE:12.3456 ABC_FOUR:12.3456 \
    ABC_THREE:12.3456 ABC_FOUR:12.3456 \
    ABC_FIVE:12.3456 ABC_SIX:12.3456 \
    ABC_SEVEN:12.3456 ABC_EIGHT:12.3456 \
    ABC_NINE:12.3456 ABC_TEN:12.3456 \
    ABC_ELEVEN:12.3456 ABC_TWELVE:12.3456 \
    END=STRING

CodePudding user response:

Question askers are usually expected to put in some effort toward a solution. Here is some code to help you started:

s = s.replace('\\\n', '')
re.findall(r'VALUE="(.*?)\s*(?: " END=STRING|END=STRING)', s, re.M)

CodePudding user response:

Firstly, you haven't really laid out the question neatly.

But suggestion 1. is just use \S instead of [a-zA-Z0-9/ ._-]

  • Related