Home > Blockchain >  How can I use Regex to abbreviate words that all start with a capital letter
How can I use Regex to abbreviate words that all start with a capital letter

Time:03-07

I want to abbreviate words from a string by writing a python script. for example, I studied in King Saud University, which is in Riyadh, the capital of Saudi Arabia; becomes I studied in KSU, which is in Riyadh, the capital of SA.

I tried to use the lambda, to scan all the string but I couldn't remove the rest of the word after finding it.

import re
str = "I studied in King Saud University, which is in Riyadh, the capital of Saudi Arabia"

result = re.sub(r"\b[A-Z]", lambda x: x.group()  ,str)

print(result)

CodePudding user response:

You need to actually consume two or more words starting with an uppercase letter.

You can use something like

result = re.sub(r"\b[A-Z]\w*(?:\s [A-Z]\w*) ", lambda x: "".join(c[0] for c in x.group().split()), text)

See the Python demo:

import re
text = "I studied in King Saud University, which is in Riyadh, the capital of Saudi Arabia"
result = re.sub(r"\b[A-Z]\w*(?:\s [A-Z]\w*) ", lambda x: "".join(c[0] for c in x.group().split()), text)
print(result)
# => I studied in KSU, which is in Riyadh, the capital of SA

See the regex demo. Details:

  • \b - a word boundary
  • [A-Z] - an uppercase ASCII letter
  • \w* - zero or more word chars
  • (?:\s [A-Z]\w*) - one or more occurrences of
    • \s - one or more whitespaces
    • [A-Z]\w* - an uppercase ASCII letter and then zero or more word chars.

The "".join(c[0] for c in x.group().split()) part grabs first chars from the non-whitespace chunks in the match value and joins them into a single string.

To support all Unicode uppercase letters, I'd advise to use PyPi regex module, and use

import regex
#...
result = regex.sub(r"\b\p{Lu}\p{L}*(?:\s \p{Lu}\p{L}*) ", lambda x: "".join(c[0] for c in x.group().split()), text)

where \p{Lu} matches any Unicode uppercase letter and \p{L} matches any Unicode letter.

CodePudding user response:

Here is a solution that works. I removed the lambda function and replaced every appearance of capital letters K,S,U each followed by lowercase letters and a space.

import re
str = "I studied in King Saud University, which is in Riyadh, the capital of Saudi Arabia"
find = re.compile(r"[K][a-z] \s[S][a-z] \s[U][a-z] ")
substitute = "KSU"

print(find.sub(substitute,str))
  • Related