Home > Blockchain >  Write a regex to match title case sentence but ignores numbers and special chars (Ex: Math 101 - Pro
Write a regex to match title case sentence but ignores numbers and special chars (Ex: Math 101 - Pro

Time:10-01

I need sentences to pass if all words are proper case, even if there are numbers and/or special characters in the sentence. I need it to fail if any words are not proper case.

This S.O. post is an almost perfect solution. I need to know how to update it to ignore numbers and special characters. I tried finding ways to combine it with other solutions, but did not have any luck.

This will be used to evaluate the output of a spreadsheet cell in Google Sheets.

Any help is greatly appreciated.

I need these to pass.

Math 101 - Prof. Smith
Mr. Smith's Math 101 Class
500 Building: Room # 200 With Mr. Smith
Mr. Smith's Math 101 Class Is In Room 500
I Am Mr. Smith

I need these to fail.

Math 101 - prof. Smith
mr. Smith's Math 101 Class
500 Building: rOom # 200 With Mr. Smith
Mr. Smith's math101 Class Is In room#500
i Am Mr. Smith

CodePudding user response:

An idea to check if each "word" starts either with an upper alphabet or a digit.

^(?:\W*[A-Z\d][\w']*\b) \W*$

See this demo at regex101 (used [^\w\n] in demo for not skipping lines)

Was not sure about eg FOO (all upper) or mixtures and considered those as valid. This regex just checks if the first character of each sequence of word characters plus single quote separated by non word characters is either an upper alphabet [A-Z] like in your linked question or a digit.

Alternatively consider to invalidate words starting with a lower alphabet: (?:[^\w']|^)[a-z]

CodePudding user response:

Per comments on the question, you want to match data formatted by Excel's PROPER() function, or a work-alike, and to avoid matching data that is inconsistent with that format. This, then, is what PROPER() does:

Capitalizes the first letter in a text string and any other letters in text that follow any character other than a letter. Converts all other letters to lowercase letters.

Note that in its full generality, "letter" does not necessarily mean Latin letter, and it is inclusive of letters bearing diacritical marks. Perhaps, though, those are characteristics that you don't actually need to worry about for your particular inputs.

You did not specify what flavor of regex you are using, but this will work with many of them:

^([[:upper:]][[:lower:]]*)?([^[:alpha:]] [[:upper:]][[:lower:]]*)*[^[:alpha:]]*$

Explanation:

  • [[:upper:]] and [[:lower:]] match uppercase and lowercase letters, respectively, as such are understood by the regex implementation. That often will include letters in each category that bear diacriticial marks, and it may include non-Latin letters.

    The strings you want to match contain uppercase letters only at the beginning of the string or after a non-letter. They contain lowercase letters only immediately following another letter (upper- or lower-case). Thus, all maximal runs of letters in the string must match [[:upper:]][[:lower:]]* -- one uppercase letter followed by zero or more lowercase.

  • Those maximal runs of letters must be bounded on each side by one of three things:

    • the beginning of the string, which is matched by ^; or
    • the end of the string, which is matched by $; or
    • a run of one or more non-letters. This is matched by [^[:alpha:]] , because [[:alpha:]] matches the union of [[:upper:]] and [[:lower:]], the ^ negates the class, and the quantifier causes the class to match any number of characters greater than zero.
  • Altogether then, we have

    1. ^ - the beginning of the line
    2. ([[:upper:]][[:lower:]]*)? - optionally, one titlecase word
    3. ([^[:alpha:]] [[:upper:]][[:lower:]]*)* - zero or more additional titlecase words, each separated from the preceding word by a nonempty run of non-letters
    4. [^[:alpha:]]* - zero or more trailing non-letters
    5. $ - the end of the line

That regex will require minor adaptation to work with some regex engines, but some variation on it will work with any regex engine I can think of. It will work as-is with many.

  • Related