I have a problem that I need help with. I have the below strings and need to do the following:
- Extract all the substrings between the equal sign and "END=STRING" string or closing double quotation mark.
- Group the extracted substrings into a single group
- Do not show the starting and ending markers in the output
- If possible, do not show the back slashes or newlines
Two samples of extended result:
STRING database file 2025.01 ABC_ONE ABC_TWO
STRING database file 2025.01 ABC_ONE:12.3456 ABC_TWO:12.3456 ABC_THREE:12.3456 ABC_FOUR:12.3456 ABC_THREE:12.3456 ABC_FOUR:12.3456 ABC_FIVE:12.3456 ABC_SIX:12.3456 ABC_SEVEN:12.3456 ABC_EIGHT:12.3456 ABC_NINE:12.3456 ABC_TEN:12.3456
I will use Python re.finditer to loop through the results I get from regex. Also, re.MULTILINE and re.IGNORECASE will be used.
Link to what I have on regex101: https://regex101.com/r/CwMaEZ/1
Feel free to suggest a different pattern but keep in mind the following:
- Groups are needed like how I show in my pattern.
- I want to iterate over the results in Python so I prefer re.finditer
Here is the regex I have so far:
(STRING)\s([a-zA-Z0-9/ ._-] )\s([a-zA-Z0-9/ ._-] )\s([a-zA-Z0-9/ ._-] )\s([a-zA-Z0-9/ ._-] )?\s?\\?\n?(.*VALUE=\s*\"?)
Here are the strings:
STRING database file 2025.01 \
0123456789ABCD VALUE="ABC_ONE \
ABC_TWO " END=STRING
ST=
STRING database file 2025.01 \
0123456789ABCD VALUE=ABC_ONE \
ABC_TWO END=STRING
ST=
STRING database file 2025.01 ABCDEFGH123456 \
VALUE=ABC_ONE ABC_TWO END=STRING
STRING database file 2025.01 \
VALUE=ABC_ONE:12.3456 END=STRING \
AAAA=ABCDEFGH1234
STRING database file 2025.01 \
VALUE="ABC_ONE:12.3456 ABC_TWO:12.3456 \
ABC_THREE:12.3456 ABC_FOUR:12.3456 " \
END=STRING \
STRING database file 2025.01 \
0123456789ABCD VALUE="ABC_ONE ABC_TWO " \
END=STRING
STRING database file 2025.01 VALUE="ABC_ONE \
ABC_TWO ABC_THREE END=STRING
STRING database file 2025.01 \
VALUE="ABC_ONE ABC_TWO ABC_THREE " END=STRING
STRING database file 2025.01 VALUE=
"ABC_ONE ABC_TWO ABC_THREE " END=STRING \
STRING database file 2025.01 VALUE="ABC_ONE \
ABC_TWO ABC_THREE " END=STRING
STRING database file 2025.01 VALUE="ABC_ONE ABC_TWO \
ABC_THREE " END=STRING
STRING database file 2025.01 VALUE="ABC_ONE ABC_TWO ABC_THREE " \
STRING database file 2025.01 \
VALUE="ABC_ONE:12.3456 ABC_TWO:12.3456 \
ABC_THREE:12.3456 ABC_FOUR:12.3456 \
ABC_THREE:12.3456 ABC_FOUR:12.3456 \
ABC_FIVE:12.3456 ABC_SIX:12.3456 \
ABC_SEVEN:12.3456 ABC_EIGHT:12.3456 \
ABC_NINE:12.3456 ABC_TEN:12.3456 \
ABC_ELEVEN:12.3456 ABC_TWELVE:12.3456 \
END=STRING
CodePudding user response:
Question askers are usually expected to put in some effort toward a solution. Here is some code to help you started:
s = s.replace('\\\n', '')
re.findall(r'VALUE="(.*?)\s*(?: " END=STRING|END=STRING)', s, re.M)
CodePudding user response:
Firstly, you haven't really laid out the question neatly.
But suggestion 1. is just use \S
instead of [a-zA-Z0-9/ ._-]