I'm trying to match all proper nouns in some given text.
So far I've got (?<![.?!]\s|^)(?<!\“)[A-Z][a-z]
which ignores capital words preceded by a .?!
and a space as well as words inside a bracket. Can be seen here.
But it doesn't catch capital words at the beginning of sentences. So given the text:
Alec, Prince, so Genoa and Lucca are now just family estates of the “What”. He said no. He, being the Prince.
It successfully catches Prince, Genoa, Lucca but not Alec.
So i'd like some help to modify it if possible, to match any capital word with nothing behind it. (I'm not sure how to define nothing)
CodePudding user response:
You can put the “
as the second alternative in the lookbehind instead of ^
which asserts the start of the string.
Then you can omit (?<!\“)
(?<![.?!]\s|“)[A-Z][a-z]
Explanation
(?<!
Negative lookbehind, assert what is directly to the left if the current position is not[.?!]\s
Match any of.
?
!
followed by a whitespace char|
Or“
Match literally
)
Close lookbehind[A-Z][a-z]
Match an uppercase char A-Z and 1 chars a-z
See a regex demo.
CodePudding user response:
The thing you're looking for is called a "word boundary", which is denoted as \b
in a lot of regex languages.
Try \b[A-Z][a-z]*\b
.