I have a series of words I try to capture.
I have the following problem:
- The string ends with a fixed set of words
- It is not clearly defined how many words the string consists of. However, it should capture all words that start with a upper case letter (German language). Therefore, the left anchor should be the first word starting with lower case.
Example (bold is what I try to capture):
I like Apple Bananas And Cars.
building houses Might Be Salty Hard said Jessica.
This is the RegEx I tried so far, it only works, if the "non-capture" string does not include any upper case words:
/(?:[a-zäöü]*)([\p{L} ().&] [Cars|Hard])/gu
CodePudding user response:
You might start the match with an uppercase character allowing German uppercase chars as well, and then optionally repeat matching either words that start with an uppercase character, or a "special character.
Then end the match with an alternation matching either Hard or Cars.
(?<!\S)[A-ZÄÖÜß][a-zA-ZäöüßÄÖÜẞ]*(?:\s (?:[A-ZÄÖÜß][a-zA-ZäöüßÄÖÜẞ]*|[ ()&]))*\s (?:Hard|Cars)\b
Explanation
(?<!\S)
Assert a whitespace boundary to the left to prevent starting the match after a non whitespace char[A-ZÄÖÜß][a-zA-ZäöüßÄÖÜẞ]*
Match a word that starts with an uppercase char(?:
Non capture group to match as a whole part\s
Match 1 whitespace chars(?:
Non capture group[A-ZÄÖÜß][a-zA-ZäöüßÄÖÜẞ]*
Match a word that starts with uppercase|
Or[ ()&]
Match one of the "special" chars
)
Close the non capture group
)*
Close the non capture group and optionally repeat it\s
Match 1 whitespace chars(?:Hard|Cars)
Match one of the alternatives\b
A word boundary to prevent a partial word match
See a regex demo.
CodePudding user response:
Use \p{Lu}
for uppercase letters:
(?:[\p{Lu} ()&][\p{L} ()&]* ) (?:Cars|Hard)
See live demo (showing matching umlauted letters and ß).