Find strings only optionally containing a specific char at the end-CodePudding

I want to find strings that have no 。 char in them, with an optional occurrence of this character at the end of the string.

I search some tips, like that, but didn't solve my problem.

^(?!\.)(?!.*\.$)(?!.*\.\.)[a-zA-Z0-9_.] $
(?!\.) - don't allow . at start
(?!.*\.\.) - don't allow 2 consecutive dots
(?!.*\.$) - don't allow . at end

I tried to use

str_l  = ["aaa。bbb。","aaa。","aaa"]
for str1 in str_l:
  res1 = re.search(r'(.*?!。*$)', str1) #if 。not in string, return True
  res2 = re.search(r'(?<!(。)。$)',str1) # if 。 only appear at the end of string, return True, but not solved
  print(res1,res2)

I want to combine res1 and res2 to one regex, and the string results like False, True, True.

CodePudding user response：

You can use

import re
str_l  = ["aaa。bbb。","aaa。","aaa"]
for str1 in str_l:
  print(str1, '=>', bool(re.search(r'^[^。]*。?$', str1)))

Output:

# => aaa。bbb。 => False
aaa。 => True
aaa => True

See the Python demo. Details:

^ - start of string
[^。]* - zero or more chars other than the dot
。? - an optional dot
$ - at the end of string.

To obtain the valid strings from the list using this regex, you can use

rx = re.compile(r'^[^。]*。?$')
print( list(filter(rx.search, str_l)) )
# => ['aaa。', 'aaa']

CodePudding user response：

This can be done with the following code.

import re

p = re.compile("^(?:(?!。).)*(。$)?(?!.*。).*$")

l = [
    "aaa。bbb。",
    "aaa bbb。",  # matches because only at end
    "aaa。bbb",
    "。aaa bbb",
    "aaa bbb",  # matches because none found
]

print([s for s in l if p.match(s)])

Which results in:

['aaa bbb。', 'aaa bbb']

The full explanation can be found here at regex101.com.

The only advantage to this matching expression over the much more terse ^[^。]*。?$ is that it can be used with strings in addition to a given character. So, say you need to match strings that may end with "foo" but it shall not appear earlier in the string. Then you could use ^(?:(?!foo).)*(foo$)?(?!.*foo).*$.

However, it is about 60% slower. You can see the test and results here:

import re
import timeit

a = re.compile("^(?:(?!。).)*(。$)?(?!.*。).*$")
b = re.compile("^[^。]*。?$")

l = [
    "aaa。bbb。",
    "aaa bbb。",  # matches because only at end
    "aaa。bbb",
    "。aaa bbb",
    "aaa bbb",  # matches because none found
]

print(
    timeit.timeit(
        "matches = [s for s in l if a.match(s)]",
        setup="from __main__ import (l, a)",
    )
)

print(
    timeit.timeit(
        "matches = [s for s in l if b.match(s)]",
        setup="from __main__ import (l, b)",
    )
)

Which gives:

2.6208932230000004
1.6510743480000003

CodePudding user response：

Another approach can be splitting on 。

If you use split and the is 。 at the end of the string, the last item in the list will be empty.

If it does not occur, the list size is 1.

str_l = ["aaa。bbb。", "aaa。", "aaa", "。", "。  ", "。。"]

for str1 in str_l:
    lst = str1.split(r"。")
    nr = len(lst)
    print(f"'{str1}' -> {nr == 1 or nr == 2 and lst[1] == ''}")

Output

'aaa。bbb。' -> False
'aaa。' -> True
'aaa' -> True
'。' -> True
'。  ' -> False
'。。' -> False

See a Python demo.