Home > Net >  Find strings only optionally containing a specific char at the end
Find strings only optionally containing a specific char at the end

Time:05-21

I want to find strings that have no char in them, with an optional occurrence of this character at the end of the string.

I search some tips, like that, but didn't solve my problem.

^(?!\.)(?!.*\.$)(?!.*\.\.)[a-zA-Z0-9_.] $
(?!\.) - don't allow . at start
(?!.*\.\.) - don't allow 2 consecutive dots
(?!.*\.$) - don't allow . at end

I tried to use

str_l  = ["aaa。bbb。","aaa。","aaa"]
for str1 in str_l:
  res1 = re.search(r'(.*?!。*$)', str1) #if 。not in string, return True
  res2 = re.search(r'(?<!(。)。$)',str1) # if 。 only appear at the end of string, return True, but not solved
  print(res1,res2)

I want to combine res1 and res2 to one regex, and the string results like False, True, True.

CodePudding user response:

You can use

import re
str_l  = ["aaa。bbb。","aaa。","aaa"]
for str1 in str_l:
  print(str1, '=>', bool(re.search(r'^[^。]*。?$', str1)))

Output:

# => aaa。bbb。 => False
aaa。 => True
aaa => True

See the Python demo. Details:

  • ^ - start of string
  • [^。]* - zero or more chars other than the dot
  • 。? - an optional dot
  • $ - at the end of string.

To obtain the valid strings from the list using this regex, you can use

rx = re.compile(r'^[^。]*。?$')
print( list(filter(rx.search, str_l)) )
# => ['aaa。', 'aaa']

CodePudding user response:

This can be done with the following code.

import re

p = re.compile("^(?:(?!。).)*(。$)?(?!.*。).*$")

l = [
    "aaa。bbb。",
    "aaa bbb。",  # matches because only at end
    "aaa。bbb",
    "。aaa bbb",
    "aaa bbb",  # matches because none found
]

print([s for s in l if p.match(s)])

Which results in:

['aaa bbb。', 'aaa bbb']

The full explanation can be found here at regex101.com.

The only advantage to this matching expression over the much more terse ^[^。]*。?$ is that it can be used with strings in addition to a given character. So, say you need to match strings that may end with "foo" but it shall not appear earlier in the string. Then you could use ^(?:(?!foo).)*(foo$)?(?!.*foo).*$.

However, it is about 60% slower. You can see the test and results here:

import re
import timeit

a = re.compile("^(?:(?!。).)*(。$)?(?!.*。).*$")
b = re.compile("^[^。]*。?$")

l = [
    "aaa。bbb。",
    "aaa bbb。",  # matches because only at end
    "aaa。bbb",
    "。aaa bbb",
    "aaa bbb",  # matches because none found
]

print(
    timeit.timeit(
        "matches = [s for s in l if a.match(s)]",
        setup="from __main__ import (l, a)",
    )
)

print(
    timeit.timeit(
        "matches = [s for s in l if b.match(s)]",
        setup="from __main__ import (l, b)",
    )
)

Which gives:

2.6208932230000004
1.6510743480000003

CodePudding user response:

Another approach can be splitting on

If you use split and the is at the end of the string, the last item in the list will be empty.

If it does not occur, the list size is 1.

str_l = ["aaa。bbb。", "aaa。", "aaa", "。", "。  ", "。。"]

for str1 in str_l:
    lst = str1.split(r"。")
    nr = len(lst)
    print(f"'{str1}' -> {nr == 1 or nr == 2 and lst[1] == ''}")

Output

'aaa。bbb。' -> False
'aaa。' -> True
'aaa' -> True
'。' -> True
'。  ' -> False
'。。' -> False

See a Python demo.

  • Related