How to split numbers and text from a sentence?-CodePudding

I have a sentence that is mixed with numbers (integer and float), and often merged with other words. I want to separate the numbers and text and put it as a sentence.

The following does some work.

str1 = str1="test1.25nb 5test .5NB 00.5my_test 5unit 5.6"
re.findall(r'\d*\.*\d \.*\d*', str1)
re.split(r'\d*\.*\d \.*\d*',  str1)

However, I could not figure out a better way that gives a result nicely.

Input: str1="test1.25nb 5test .5NB 00.5my_test 5unit 5.6"

Expected output: test 1.25 nb 5 test .5 NB 00.5 my_test 5 unit 5.6"

Thanks in advance.

CodePudding user response：

You can use

import re
str1 = "test1.25nb 5test .5NB 00.5my_test 5unit 5.6"
print( " ".join(re.split(r'\s*(\d*\.?\d )\s*', str1)) )
# => test 1.25 nb 5 test .5 NB 00.5 my_test 5 unit 5.6

Or, directly using re.sub with strip() at the end:

print( re.sub(r'\s*(\d*\.?\d )\s*', r' \1 ', str1).strip() )

See the Python demo. The \s*(\d*\.?\d )\s* regex matches

\s* - zero or more whitespaces
(\d*\.?\d ) - captures into Group 1 (and hence these values are also present in the resulting list produced with re.split) zero or more digits, an optional . and one or more digits
\s* - zero or more whitespaces.

See the regex demo.

CodePudding user response：

If you are not tied to regular expressions, this may be a bit easier to understand:

import string

str1 = "test1.25nb 5test .5NB 00.5my_test 5unit 5.6"
cnt = len(str1)
str2 = ""
numdigits = string.digits   "."

print(str1)
for i, c in enumerate(str1):
    str2  = c
    if i < cnt - 1:
        nextc = str1[i   1]
        if c in numdigits and nextc in string.ascii_letters or c in string.ascii_letters and nextc in numdigits:
            str2  = " "
print(str2)

The basic logic is simple: for each character, peak at the next char, and see if there is a change between alphabetic and numeric status. If so, insert a space.

Note that the enumerate(list) built-in function returns a pair of values, an index value followed by the next element of the list. This can simplify the indexing process within a loop.