Extract text before two underscores with regex-CodePudding

I feel I am not far from a solution but I still struggle to extract some text from variables with Regex. The conditions are:

The text can only contain upper case characters or integers
The text can contain underscores BUT not two consecutive ones

Examples:

test_TEST_TEST_1_TEST_13DAHA bfd     -->   TEST_TEST_1_TEST_13DAHA
test_TEST_TEST_1_TEST__13DAHA bfd    -->   TEST_TEST_1_TEST
test__TEST_TEST                      -->   TEST_TEST
test_TEST__DHJF                      -->   TEST
test_TEST__Ddsa                      -->   TEST
test__TEST                           -->   TEST

So far, I got

_([0-9A-Z_] ) works for the first one but not the second
_([0-9A-Z_] )(?:__.*) works for the second one but not the first

CodePudding user response：

You say that you don't want the string after the two underscores if some string already matched before, i.e. you may consume and omit all the letters/digits after double or more underscores after capturing the expected pattern:

_([0-9A-Z] (?:_[0-9A-Z] )*)(?:__[0-9A-Z_] )?

See the regex demo. The (?:__[0-9A-Z_] )? part "excludes" trying to match the text that comes right after your match.

You need to get Group 1 values rather than the whole match values, hence re.findall suits best here, especially if you expect multiple matches. If you expect a single match per string, use re.search.

See the Python demo:

import re
cases = ["test_TEST bfd",
    "test_TEST_TEST_1_TEST_13DAHA bfd",
    "test_TEST_TEST_1_TEST__13DAHA bfd",
    "test__TEST_TEST",
    "test_TEST__DHJF",
    "test_TEST__Ddsa"
]
pattern = re.compile(r'_([0-9A-Z] (?:_[0-9A-Z] )*)(?:__[0-9A-Z_] )?')
for case in cases:
    matches = pattern.findall(case)
    print(matches)

With re.search:

for case in cases:
    match = pattern.search(case)
    if match:
        print(match.group(1))

Output:

TEST
TEST_TEST_1_TEST_13DAHA
TEST_TEST_1_TEST
TEST_TEST
TEST
TEST

CodePudding user response：

Use this pattern:

[A-Z0-9] (?=_)(?:_[A-Z0-9] )*

Sample script:

inp = ["test_TEST_TEST_1_TEST_13DAHA bfd", "test_TEST_TEST_1_TEST__13DAHA bfd", "test__TEST_TEST", "test_TEST__DHJF"]
for i in inp:
    matches = re.findall(r'[A-Z0-9] (?=_)(?:_[A-Z0-9] )*', i)
    print(i   " => "   matches[0])

This prints:

test_TEST_TEST_1_TEST_13DAHA bfd => TEST_TEST_1_TEST_13DAHA
test_TEST_TEST_1_TEST__13DAHA bfd => TEST_TEST_1_TEST
test__TEST_TEST => TEST_TEST
test_TEST__DHJF => TEST