I'm trying to build a function that will collect an acronym using only regular expressions.
Example: Data Science = DS
I'm trying to do 3 steps:
- Find the first letter of each word
- Translate every single letter to uppercase.
- Group
Unfortunately I get errors. I repeat that I need to use the regular expression functionality. Regular expression for creating an acronym.
some_words = 'Data Science'
all_words_select = r'(\b\w)'
word_upper = re.sub(all_words_select, some_words.upper(), some_words)
print(word_upper)
result: DATA SCIENCEata DATA SCIENCEcience
Why is the text duplicated? I plan to get: DATA SCIENCE
CodePudding user response:
You don't need regex for the problem you have stated. You can just split the words on space, then take the first character and convert it to the upper case, and finally join them all.
>>> ''.join(w[0].upper() for w in some_words.split(' '))
>>> 'DS'
You need to deal with special condition such as word starting with character other than alphabets, with something like if w[0].isalpha()
The another approach using re.sub
and negative lookbehind:
>>> re.sub(r'(?<!\b).|\s','', some_words)
'DS'
CodePudding user response:
Use
import re
some_words = 'Data Science'
all_words_select = r'\b(?![\d_])(\w)|.'
word_upper = re.sub(all_words_select, lambda z: z.group(1).upper() if z.group(1) else '', some_words, flags=re.DOTALL)
print(word_upper)
See Python proof.
EXPLANATION
- Match a letter at the word beginning => capture (
\b(?![\d_])(\w)
) - Else, match any character (
|.
) - Whenever capture is not empty replace with a capital variant (
z.group(1).upper()
) - Else, remove the match (
''
).
Pattern:
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
[\d_] any character of: digits (0-9), '_'
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
\w word characters (a-z, A-Z, 0-9, _)
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
. any character except \n