I want the regex that allows me to match words that have hyphen in the middle and start with uppercase letter words that start with uppercase letter without hyphen.
also i want only the first letter to be uppercase, all the others are lowercase, something like (ENGLAND) is not what i need, because all letters are uppercase
I will give examples for all the wanted words' structure:
Wilkes-Barre
California
I have tried:
[A-Z][a-z-]\ [A-Z][a-z]\
but it only matches things like Wilkes-Barre it doesnt match California also tried
[A-Z][a-z-]\
this one matches things like California, but it matches Wilkes-Barre as it is 2 words: Wilkes- and Barre
So if someone please can help me find the regex that matches those 2 types of words, so if grep a file that has
Wilkes-Barre
California
ENGLAND
rome
It will only match the first 2 and it will give 2 matches not 3.
CodePudding user response:
You do not specify if a single upper-case latter should match. Let's assume the answer is yes. The following should do what you want:
$ grep -E '^((^|-)[A-Z][a-z]*) $' data.txt
Wilkes-Barre
California
It matches entire lines (because of the leading ^
and trailing $
) of one or more tokens (one or more because of the
) where each token is a hyphen or the beginning of the line ((^|-)
) followed by a single upper case letter ([A-Z]
) and zero or more lower case letters ([a-z]*
).
If there must be at least one lower case letter after the upper case letter, just replace the *
by a
:
grep -E '^((^|-)[A-Z][a-z] ) $' data.txt
These regexes also match a line like -Foobar
. If this is not wanted the following excludes lines that start with a hyphen:
grep -E '^[A-Z][a-z]*(-[A-Z][a-z]*)*$' data.txt
or (if at least one lower case letter is required):
grep -E '^[A-Z][a-z] (-[A-Z][a-z] )*$' data.txt
Finally, if there is at most one hyphen (no Foo-Bar-Baz
):
grep -E '^[A-Z][a-z]*(-[A-Z][a-z]*)?$' data.txt
or:
grep -E '^[A-Z][a-z] (-[A-Z][a-z] )?$' data.txt
CodePudding user response:
You can use
grep -E '^[[:upper:]][[:lower:]] (-[[:upper:]][[:lower:]]*)?$'
See the online demo:
#!/bin/bash
s='Wilkes-Barre
California'
grep -E '^[[:upper:]][[:lower:]] (-[[:upper:]][[:lower:]]*)?$' <<< "$s"
Output:
Wilkes-Barre
California
POSIX ERE pattern details:
^
- start of string[[:upper:]]
- an uppercase letter[[:lower:]]
- one or more lowercase letters(-[[:upper:]][[:lower:]]*)?
- an optional occurrence of an uppercase letter and then one or more lowercase letters$
- end of string.
NOTE: If you need to match strings with more than one hyphen, replace the last ?
with *
.
CodePudding user response:
Normally the answer should be:
grep "^[A-Z][a-z-] " test.txt
However on my system, the plus-sign is not recognised, so I have to go for:
grep "^[A-Z][a-z-][a-z-]*" test.txt
Explanation:
^ : start of the line
[A-Z] : all possible uppercase letters
[a-z-] : all possible lowercase letters or a hyphen
Edit after comment
This, however, only shows the first part of Wilkes-Barre. If you want both, you might try this:
egrep "^[A-Z][a-z-] |^[A-Z][a-z-] [A-Z][a-z-] " test.txt