I want to grep words that have a hyphen in the middle and start with uppercase letter words that s-CodePudding

I want the regex that allows me to match words that have hyphen in the middle and start with uppercase letter words that start with uppercase letter without hyphen.
also i want only the first letter to be uppercase, all the others are lowercase, something like (ENGLAND) is not what i need, because all letters are uppercase

I will give examples for all the wanted words' structure:

Wilkes-Barre
California

I have tried:

[A-Z][a-z-]\ [A-Z][a-z]\

but it only matches things like Wilkes-Barre it doesnt match California also tried

[A-Z][a-z-]\

this one matches things like California, but it matches Wilkes-Barre as it is 2 words: Wilkes- and Barre

So if someone please can help me find the regex that matches those 2 types of words, so if grep a file that has

Wilkes-Barre
California
ENGLAND
rome

It will only match the first 2 and it will give 2 matches not 3.

CodePudding user response：

You do not specify if a single upper-case latter should match. Let's assume the answer is yes. The following should do what you want:

$ grep -E '^((^|-)[A-Z][a-z]*) $' data.txt 
Wilkes-Barre
California

It matches entire lines (because of the leading ^ and trailing $) of one or more tokens (one or more because of the ) where each token is a hyphen or the beginning of the line ((^|-)) followed by a single upper case letter ([A-Z]) and zero or more lower case letters ([a-z]*).

If there must be at least one lower case letter after the upper case letter, just replace the * by a :

grep -E '^((^|-)[A-Z][a-z] ) $' data.txt

These regexes also match a line like -Foobar. If this is not wanted the following excludes lines that start with a hyphen:

grep -E '^[A-Z][a-z]*(-[A-Z][a-z]*)*$' data.txt

or (if at least one lower case letter is required):

grep -E '^[A-Z][a-z] (-[A-Z][a-z] )*$' data.txt

Finally, if there is at most one hyphen (no Foo-Bar-Baz):

grep -E '^[A-Z][a-z]*(-[A-Z][a-z]*)?$' data.txt

or:

grep -E '^[A-Z][a-z] (-[A-Z][a-z] )?$' data.txt

CodePudding user response：

You can use

grep -E '^[[:upper:]][[:lower:]] (-[[:upper:]][[:lower:]]*)?$'

See the online demo:

#!/bin/bash
s='Wilkes-Barre
California'
grep -E '^[[:upper:]][[:lower:]] (-[[:upper:]][[:lower:]]*)?$' <<< "$s"

Output:

Wilkes-Barre
California

POSIX ERE pattern details:

^ - start of string
[[:upper:]] - an uppercase letter
[[:lower:]] - one or more lowercase letters
(-[[:upper:]][[:lower:]]*)? - an optional occurrence of an uppercase letter and then one or more lowercase letters
$ - end of string.

NOTE: If you need to match strings with more than one hyphen, replace the last ? with *.

CodePudding user response：

Normally the answer should be:

grep "^[A-Z][a-z-] " test.txt

However on my system, the plus-sign is not recognised, so I have to go for:

grep "^[A-Z][a-z-][a-z-]*" test.txt

Explanation:

^      : start of the line
[A-Z]  : all possible uppercase letters
[a-z-] : all possible lowercase letters or a hyphen

Edit after comment
This, however, only shows the first part of Wilkes-Barre. If you want both, you might try this:

egrep "^[A-Z][a-z-] |^[A-Z][a-z-] [A-Z][a-z-] " test.txt