I feel I am not far from a solution but I still struggle to extract some text from variables with Regex. The conditions are:
- The text can only contain upper case characters or integers
- The text can contain underscores BUT not two consecutive ones
Examples:
test_TEST_TEST_1_TEST_13DAHA bfd --> TEST_TEST_1_TEST_13DAHA
test_TEST_TEST_1_TEST__13DAHA bfd --> TEST_TEST_1_TEST
test__TEST_TEST --> TEST_TEST
test_TEST__DHJF --> TEST
test_TEST__Ddsa --> TEST
test__TEST --> TEST
So far, I got
_([0-9A-Z_] )
works for the first one but not the second
_([0-9A-Z_] )(?:__.*)
works for the second one but not the first
CodePudding user response:
You say that you don't want the string after the two underscores if some string already matched before, i.e. you may consume and omit all the letters/digits after double or more underscores after capturing the expected pattern:
_([0-9A-Z] (?:_[0-9A-Z] )*)(?:__[0-9A-Z_] )?
See the regex demo. The (?:__[0-9A-Z_] )?
part "excludes" trying to match the text that comes right after your match.
You need to get Group 1 values rather than the whole match values, hence re.findall
suits best here, especially if you expect multiple matches. If you expect a single match per string, use re.search
.
See the Python demo:
import re
cases = ["test_TEST bfd",
"test_TEST_TEST_1_TEST_13DAHA bfd",
"test_TEST_TEST_1_TEST__13DAHA bfd",
"test__TEST_TEST",
"test_TEST__DHJF",
"test_TEST__Ddsa"
]
pattern = re.compile(r'_([0-9A-Z] (?:_[0-9A-Z] )*)(?:__[0-9A-Z_] )?')
for case in cases:
matches = pattern.findall(case)
print(matches)
for case in cases:
match = pattern.search(case)
if match:
print(match.group(1))
Output:
TEST
TEST_TEST_1_TEST_13DAHA
TEST_TEST_1_TEST
TEST_TEST
TEST
TEST
CodePudding user response:
Use this pattern:
[A-Z0-9] (?=_)(?:_[A-Z0-9] )*
Sample script:
inp = ["test_TEST_TEST_1_TEST_13DAHA bfd", "test_TEST_TEST_1_TEST__13DAHA bfd", "test__TEST_TEST", "test_TEST__DHJF"]
for i in inp:
matches = re.findall(r'[A-Z0-9] (?=_)(?:_[A-Z0-9] )*', i)
print(i " => " matches[0])
This prints:
test_TEST_TEST_1_TEST_13DAHA bfd => TEST_TEST_1_TEST_13DAHA
test_TEST_TEST_1_TEST__13DAHA bfd => TEST_TEST_1_TEST
test__TEST_TEST => TEST_TEST
test_TEST__DHJF => TEST