Regex for for creating an acronym-CodePudding

I'm trying to build a function that will collect an acronym using only regular expressions.

Example: Data Science = DS

I'm trying to do 3 steps:

Find the first letter of each word
Translate every single letter to uppercase.
Group

Unfortunately I get errors. I repeat that I need to use the regular expression functionality. Regular expression for creating an acronym.

some_words = 'Data Science'
all_words_select = r'(\b\w)'
word_upper = re.sub(all_words_select, some_words.upper(), some_words)
print(word_upper)

result: DATA SCIENCEata DATA SCIENCEcience

Why is the text duplicated? I plan to get: DATA SCIENCE

CodePudding user response：

You don't need regex for the problem you have stated. You can just split the words on space, then take the first character and convert it to the upper case, and finally join them all.

>>> ''.join(w[0].upper() for w in some_words.split(' '))
>>> 'DS'

You need to deal with special condition such as word starting with character other than alphabets, with something like if w[0].isalpha()

The another approach using re.sub and negative lookbehind:

>>> re.sub(r'(?<!\b).|\s','', some_words)
'DS'

CodePudding user response：

Use

import re

some_words = 'Data Science'
all_words_select = r'\b(?![\d_])(\w)|.'
word_upper = re.sub(all_words_select, lambda z: z.group(1).upper() if z.group(1) else '', some_words, flags=re.DOTALL)
print(word_upper)

See Python proof.

EXPLANATION

Match a letter at the word beginning => capture (\b(?![\d_])(\w))
Else, match any character (|.)
Whenever capture is not empty replace with a capital variant (z.group(1).upper())
Else, remove the match ('').

Pattern:

--------------------------------------------------------------------------------
  \b                       the boundary between a word char (\w) and
                           something that is not a word char
--------------------------------------------------------------------------------
  (?!                      look ahead to see if there is not:
--------------------------------------------------------------------------------
    [\d_]                    any character of: digits (0-9), '_'
--------------------------------------------------------------------------------
  )                        end of look-ahead
--------------------------------------------------------------------------------
  (                        group and capture to \1:
--------------------------------------------------------------------------------
    \w                       word characters (a-z, A-Z, 0-9, _)
--------------------------------------------------------------------------------
  )                        end of \1
--------------------------------------------------------------------------------
 |                        OR
--------------------------------------------------------------------------------
  .                        any character except \n