How can I find words with three or more vowels of the same kind with a regular expression using back referencing?
I'm searching in text with a 3-column tab format "Word PoS Lemma".
This is what I have so far:
ggrep -P -i --colour=always '^\w*([aeioueöäüèéà])\w*?\1\w*?\1\w*?\t' filename
However, this gives me words with three vowels but not of the same kind.
I'm confused, because I thought the back referencing would refer to the same vowel it found in the brackets? I solved this problem by changing the .*?
to \w*?
.
But I still need to know how I can achieve the or more part?
Thanks for the help!
CodePudding user response:
Your regex looks too complicated, not sure what you're trying to accomplish with the .*?
but the usage looks suspect. I'd use something like:
([aeioueöäüèéà])\1\1
i.e. match a vowel as a capture group, then say you need two more.
Didn't realise you wanted to allow other letters between vowels, just allow zero or more "word" letters between backreferences:
([aeioueöäüèéà])(\w*\1){2}
CodePudding user response:
Using grep
$ grep -E '(([aeioueöäüèéà])[^\2]*){3,}' input_file