Home > Enterprise >  How can I split strings so that only numbers are selected?
How can I split strings so that only numbers are selected?

Time:02-15

please excuse me for the change in this question

I want to split a string (e.g. text1, text2), so that only numbers are output:

I tried the following:

import re

# example text1

text1 = " climb   -  95/ 85     0.18   low     -  4680"

split_text1 = re.split("  ", text1)

print(split_text1)

['', 'climb', '-', '95/', '85', '0.18', 'low', '-', '4680']

# example text2

text2 = "CD 3 TO   F TO GD   .80000E 02   .00000E 00   .00000E 00   .00000E 00 /"

split_text2 = re.split("  ", text2)

print(split_text2)

['CD 3 TO', 'F TO GD', '.80000E 02', '.00000E 00', '.00000E 00', '.00000E 00 /']

How can I get as result:

# split_text1 = ['95', '85', '0.18', '4680']

# split_text2 = ['3', '80.0','0.0', '0.0', 0.0]

CodePudding user response:

Simply add a second space before the . This will stop the 95/ 85 from being split. If you want \n at the end of the last item, add text = "\n".

import re

text = " climb   -  95/ 85     0.18   low     -  4680"

text = "a "   text

text  = "\n"

split_text = re.split("   ", text)

if split_text[0] == "a":
  split_text[0] = ""
else:
  split_text[0] = split_text[0][2:]

print(split_text)

CodePudding user response:

First version of the question:

You can ask to split with at least 2 spaces

import re

text = " climb   -  95/ 85     0.18   low     -  4680"

split_text = re.split("\s{2,}", text)

print(split_text)
# [' climb', '-', '95/ 85', '0.18', 'low', '-', '4680']

Works too without regex

text = " climb   -  95/ 85     0.18   low     -  4680"

split_text = text.split('  ')

print(split_text)
# [' climb', ' -', '95/ 85', '', ' 0.18', ' low', '', ' -', '4680']

With some more manipulation, you can also remove extra spaces

text = " climb   -  95/ 85     0.18   low     -  4680"

split_text = list(map(lambda x: x.strip(), text.split('  ')))

print(split_text)
# ['climb', '-', '95/ 85', '', '0.18', 'low', '', '-', '4680']

Revised question

You need to match numbers (\d in regex), some are floats (so we need to match a single dot), some are exponential (we need to match E )

Some thing like that should be a good start

import re

regex = r'[\d.E -] ' # try to match `E` and negatives

text1 = " climb   -  95/ 85     0.18   low     -  4680"
text2 = "CD 3 TO   F TO GD   .80000E 02   .00000E 00   .00000E 00   .00000E 00 /"

results1 = re.findall(regex, text1)
# ['-', '95', '85', '0.18', '-', '4680']

results2 = re.findall(regex, text2)
# ['3', '.80000E 02', '.00000E 00', '.00000E 00', '.00000E 00']

It matches a single - without numbers, we can be more specific for negative numbers.

import re

regex = r'-?\d [\d.E -]*'

text1 = " climb   -  95/ 85     0.18   low     -  4680"
text2 = "CD 3 TO   F TO GD   .80000E 02   .00000E 00   .00000E 00   .00000E 00 /"

results1 = re.findall(regex, text1)
# ['95', '85', '0.18', '4680']

results2 = re.findall(regex, text2)
# ['3', '80000E 02', '00000E 00', '00000E 00', '00000E 00']

You need to transform exponential to a float form, again, a map should do it

import re

regex = r'-?\d [\d.E -]*'

text1 = " climb   -  95/ 85     0.18   low     -  4680"
text2 = "CD 3 TO   F TO GD   .80000E 02   .00000E 00   .00000E 00   .00000E 00 /"

results1 = list(map(float, re.findall(regex, text1)))
# [95.0, 85.0, 0.18, 4680.0]

results2 = list(map(float, re.findall(regex, text2)))
# [3.0, 8000000.0, 0.0, 0.0, 0.0]

To more close to your proposition

import re

regex = r'-?\d [\d.E -]*'

def transform(value):
    if 'E' in value:
        return str(float(value))
    
    return value

text1 = " climb   -  95/ 85     0.18   low     -  4680"
text2 = "CD 3 TO   F TO GD   .80000E 02   .00000E 00   .00000E 00   .00000E 00 /"

results1 = list(map(transform, re.findall(regex, text1)))
# ['95', '85', '0.18', '4680']

results2 = list(map(transform, re.findall(regex, text2)))
# ['3', '8000000.0', '0.0', '0.0', '0.0']

And I just see now, that my regex miss the first dot....

import re

regex = r'-?(?:\d*\.\d |\d )(?:E[ -]\d )?'

def transform(value):
    if 'E' in value:
        return str(float(value))
    
    return value

text1 = " climb   -  95/ 85     0.18   low     -  4680"
text2 = "CD 3 TO   F TO GD   .80000E 02   .00000E 00   .00000E 00   .00000E 00 /"

results1 = list(map(transform, re.findall(regex, text1)))
# ['95', '85', '0.18', '4680']

results2 = list(map(transform, re.findall(regex, text2)))
# ['3', '80.0', '0.0', '0.0', '0.0']

To explain a little, -? it may start with minus.

(?: ) group without capturing, easier to group without changing the result

\d*\.\d match at least a dot and numbers after, may after numbers before the dot

| simple or

\d match any numbers

(?:\d*\.\d |\d ) everything together, so a group without capture that match any float or any integer

[ -] can be or -

(?:E[ -]\d )? quite the same, it a group without capture that match an E followed by or - with any integer after, the group itself can be here one time or never (the last ?)

CodePudding user response:

You could use findall to get the numeric patterns and convert the strings to float or int:

import re
def getNums(S):
    pattern = r"[ -]?(?:[0-9] \.?[0-9]*|\.[0-9] )(?:[Ee][ -]?[0-9] )?"
    result = []
    for part in re.findall(pattern,S):
        try:
            result.append(float(part))
            result[-1] = int(part)
        except ValueError:pass
    return result
                
text = " climb   -  95/ 85     0.18   low     -  4680"
print(getNums(text))
# [95, 85, 0.18, 4680]

text2 = "CD 3 TO   F TO GD   .80000E 02   .00000E 00   .00000E 00   .00000E 00 /"#
print(getNums(text2))
# [3, 80.0, 0.0, 0.0, 0.0]

I'm assuming you want the output to be all numeric values rather than a mix of reformatted strings and numerics

Here's a breakdown of the expression:

  • [ -]? Optional leading sign
  • (?:[0-9] \.?[0-9]*|\.[0-9] ) Mandatory central part (non-capturing group)
    • ...|... either start with a digit or with a decimal point
    • [0-9] \.?[0-9]* start with digit(s) with optional decimal point and optional fractional digits
    • \.[0-9] start with a decimal point followed by one or more digits (i.e. a decimal point without digits on the left or right is not a number.)
  • (?:[Ee][ -][0-9] )? Optional exponent part (non-capturing group)
    • E oe e to indicate start of exponent part
    • [ -]? optional sign of exponent
    • [0-9] mandatory exponent digits
  • Related