I want to abbreviate words from a string by writing a python script. for example, I studied in King Saud University, which is in Riyadh, the capital of Saudi Arabia; becomes I studied in KSU, which is in Riyadh, the capital of SA.
I tried to use the lambda, to scan all the string but I couldn't remove the rest of the word after finding it.
import re
str = "I studied in King Saud University, which is in Riyadh, the capital of Saudi Arabia"
result = re.sub(r"\b[A-Z]", lambda x: x.group() ,str)
print(result)
CodePudding user response:
You need to actually consume two or more words starting with an uppercase letter.
You can use something like
result = re.sub(r"\b[A-Z]\w*(?:\s [A-Z]\w*) ", lambda x: "".join(c[0] for c in x.group().split()), text)
See the Python demo:
import re
text = "I studied in King Saud University, which is in Riyadh, the capital of Saudi Arabia"
result = re.sub(r"\b[A-Z]\w*(?:\s [A-Z]\w*) ", lambda x: "".join(c[0] for c in x.group().split()), text)
print(result)
# => I studied in KSU, which is in Riyadh, the capital of SA
See the regex demo. Details:
\b
- a word boundary[A-Z]
- an uppercase ASCII letter\w*
- zero or more word chars(?:\s [A-Z]\w*)
- one or more occurrences of\s
- one or more whitespaces[A-Z]\w*
- an uppercase ASCII letter and then zero or more word chars.
The "".join(c[0] for c in x.group().split())
part grabs first chars from the non-whitespace chunks in the match value and joins them into a single string.
To support all Unicode uppercase letters, I'd advise to use PyPi regex module, and use
import regex
#...
result = regex.sub(r"\b\p{Lu}\p{L}*(?:\s \p{Lu}\p{L}*) ", lambda x: "".join(c[0] for c in x.group().split()), text)
where \p{Lu}
matches any Unicode uppercase letter and \p{L}
matches any Unicode letter.
CodePudding user response:
Here is a solution that works. I removed the lambda function and replaced every appearance of capital letters K,S,U each followed by lowercase letters and a space.
import re
str = "I studied in King Saud University, which is in Riyadh, the capital of Saudi Arabia"
find = re.compile(r"[K][a-z] \s[S][a-z] \s[U][a-z] ")
substitute = "KSU"
print(find.sub(substitute,str))