Find how many words are in a sentence that meet specific conditions-CodePudding

There is a recruiting task, which I wanted to solve just using regex.

A sentence is made of a group of words. Each word is made of letters [a-zA-Z], which may contain one or more dashes and may end in a punctuation mark such as period (.), comma (,), question mark (?) or exclamation mark (!). Each word cannot start with any non-letter or digits. A single letter, separated by white space chars, is also accepted.

Dashes join two or more words into one and should be accepted (but doubled dashes "--" and more not), while other valid punctuation marks, at the end of the word, should be stripped.

Valid words (examples):

"foo-foo?!.,": result = "foo-foo",
"f-foo-foo?!.,": result = "f-foo-foo",

Invalid words (examples):

"!@foo-foo{{}}}(("
"foo--foo"
"f-foo@-@foo"
"f123-foo123-foo-"
"-f-foo-foo-"

I tried to solve the problem in python with only regex:

import re

TESTSTR1 = 'there should be 9 valid  words, including: a well-behave, right?'
TESTSTR2 = 'blabla! bla121 {{blabla123bla.. bla-blablabla!! b;a-bla@!. blabla bla-bla-bla-bla **bla-bla'
TESTSTR3 = '{{)foo! ~~foo121 foo--foo?. foo-foo?!{. @foo-foo! f 23 foo2 f-ff-fff-ffff!.,?  **foo-f'

TESTSTR1_EXPECTED = ['there', 'should', 'be', 'valid', 'words', 'including', 'a', 'well-behave','right']
TESTSTR2_EXPECTED = ['blabla', 'bla-blablabla', 'blabla', 'bla-bla-bla-bla', 'bla-bla']
TESTSTR3_EXPECTED = ['f', 'f-ff-fff-ffff','foo-f']


def find_words(sentence: str) -> list:
    pattern_dash = r'\b([^\d\s] (?:-\w [a-zA-Z]*))\b'
    pattern = r'\b(?!\w -\w )(?!-\w )[a-zA-Z] \b'

    words = re.findall(pattern_dash, sentence)
    words  = re.findall(pattern, sentence)

    return words


if __name__ == "__main__":
    print('====================== TEST1 ======================')
    print(f'Expected "TESTSTR1" = {TESTSTR1_EXPECTED}')
    print(f'Result "TESTSTR1"   = {find_words(TESTSTR1)}')
        
    print('====================== TEST2 ======================')
    print(f'Expected "TESTSTR2" = {TESTSTR2_EXPECTED}')
    print(f'Result "TESTSTR2"   = {find_words(TESTSTR2)}')

    print('====================== TEST3 ======================')
    print(f'Expected "TESTSTR3" = {TESTSTR3_EXPECTED}')
    print(f'Result "TESTSTR3"   = {find_words(TESTSTR3)}')

First I wanted to find all valid words that contain the dash symbol ("pattern_dash"), and then all other valid words (excluding those already found).

I tried many different combinations of regexes but without success. I am not sure if the task is solvable using only regex.

Does anyone know if it is possible to solve it using only regex? Do you have any idea how to do it?

Many thanks

CodePudding user response：

To get the matches in the example data, you can use a capture group.

First match either a space or *, then capture the words with only A-Za-z, optionally separated by -, and assert that the words either end with a space, the end of the string, or 1 or more punctuation characters that are followed by a right hand whitespace boundary.

(?:[ *]|^)([a-zA-Z] (?:-[a-zA-Z] )*)(?= |$|[.,!?:] (?!\S))

In parts the pattern matches:

(?:[ *]|^) Non capture group, match either or * or assert the start of the string
( Capture group 1
- [a-zA-Z] Match 1 occurrences of A-Za-z
- (?:-[a-zA-Z] )* Optionally repeat the same again preceded by a -
) Close group 1
(?= Positive lookahead, assert that directly to the right is
- Match a space
- | Or
- $ Assert the end of the string
- | Or
- [.,!?:] (?!\S) Match 1 or more occurrences out from the character class [.,!?:] and assert a whitespace boundary to the right
) Close lookahead

See a regex demo and a Python demo

For example

import re

strings = [
     "there should be 9 valid  words, including: a well-behave, right?",
     "blabla! bla121 {{blabla123bla.. bla-blablabla!! b;a-bla@!. blabla bla-bla-bla-bla **bla-bla",
     "{{)foo! ~~foo121 foo--foo?. foo-foo?!{. @foo-foo! f 23 foo2 f-ff-fff-ffff!.,?  **foo-f"
]

pattern = r"(?:[ *]|^)([a-zA-Z] (?:-[a-zA-Z] )*)(?= |$|[.,!?:] (?!\S))"
for s in strings:
     print(re.findall(pattern, s, re.M))

Output

['there', 'should', 'be', 'valid', 'words', 'including', 'a', 'well-behave', 'right']
['blabla', 'bla-blablabla', 'blabla', 'bla-bla-bla-bla', 'bla-bla']
['f', 'f-ff-fff-ffff', 'foo-f']