Home > Software design >  Extract text before two underscores with regex
Extract text before two underscores with regex

Time:01-18

I feel I am not far from a solution but I still struggle to extract some text from variables with Regex. The conditions are:

  • The text can only contain upper case characters or integers
  • The text can contain underscores BUT not two consecutive ones

Examples:

test_TEST_TEST_1_TEST_13DAHA bfd     -->   TEST_TEST_1_TEST_13DAHA
test_TEST_TEST_1_TEST__13DAHA bfd    -->   TEST_TEST_1_TEST
test__TEST_TEST                      -->   TEST_TEST
test_TEST__DHJF                      -->   TEST
test_TEST__Ddsa                      -->   TEST
test__TEST                           -->   TEST

So far, I got

_([0-9A-Z_] ) works for the first one but not the second
_([0-9A-Z_] )(?:__.*) works for the second one but not the first

CodePudding user response:

You say that you don't want the string after the two underscores if some string already matched before, i.e. you may consume and omit all the letters/digits after double or more underscores after capturing the expected pattern:

_([0-9A-Z] (?:_[0-9A-Z] )*)(?:__[0-9A-Z_] )?

See the regex demo. The (?:__[0-9A-Z_] )? part "excludes" trying to match the text that comes right after your match.

You need to get Group 1 values rather than the whole match values, hence re.findall suits best here, especially if you expect multiple matches. If you expect a single match per string, use re.search.

See the Python demo:

import re
cases = ["test_TEST bfd",
    "test_TEST_TEST_1_TEST_13DAHA bfd",
    "test_TEST_TEST_1_TEST__13DAHA bfd",
    "test__TEST_TEST",
    "test_TEST__DHJF",
    "test_TEST__Ddsa"
]
pattern = re.compile(r'_([0-9A-Z] (?:_[0-9A-Z] )*)(?:__[0-9A-Z_] )?')
for case in cases:
    matches = pattern.findall(case)
    print(matches)

With re.search:

for case in cases:
    match = pattern.search(case)
    if match:
        print(match.group(1))

Output:

TEST
TEST_TEST_1_TEST_13DAHA
TEST_TEST_1_TEST
TEST_TEST
TEST
TEST

CodePudding user response:

Use this pattern:

[A-Z0-9] (?=_)(?:_[A-Z0-9] )*

Sample script:

inp = ["test_TEST_TEST_1_TEST_13DAHA bfd", "test_TEST_TEST_1_TEST__13DAHA bfd", "test__TEST_TEST", "test_TEST__DHJF"]
for i in inp:
    matches = re.findall(r'[A-Z0-9] (?=_)(?:_[A-Z0-9] )*', i)
    print(i   " => "   matches[0])

This prints:

test_TEST_TEST_1_TEST_13DAHA bfd => TEST_TEST_1_TEST_13DAHA
test_TEST_TEST_1_TEST__13DAHA bfd => TEST_TEST_1_TEST
test__TEST_TEST => TEST_TEST
test_TEST__DHJF => TEST
  • Related