Home > Back-end >  Extracting numbers from a string with a special structure with regular expressions
Extracting numbers from a string with a special structure with regular expressions

Time:05-06

I have a string with the structure

Resolution:  1200, Time: 16.255 (7.920 GFlop => 1487.23 MFlop/s, residual 0.007113, 500 iterations)

and I am trying to efficiently extract the floats from that string. As the stream of string does not only contain strings with this pattern but others as well. Is there a way to get the numbers from this string with the package re?

I am looking for a way like this:

import re

a = "Resolution:  1200, Time: 16.255 (7.920 GFlop => 1487.23 MFlop/s, residual 0.007113, 500 iterations)"

nbrs = re.match("Resolution:  \d, Time: \d (\d GFlop => \d MFlop/s, residual \d, \d iterations)"

where \d is an identifier for an arbitrary float? Or is the easiest way of doing this just to strip the string multiple times and check for specific contents?

CodePudding user response:

Given what you said in your comment to me, I believe a more appropriate solution for your problem might be this:

import re

s = 'Resolution:  1200, Time: 16.255 (7.920 GFlop => 1487.23 MFlop/s, residual 0.007113, 500 iterations)'

pattern = re.compile(r"Resolution:  (?P<resolution>\d ), Time: (?P<time>\d \.\d ) \((?P<gflops>\d \.\d ) GFlop => (?P<mflops>\d \.\d ) MFlop/s, residual (?P<residual>\d \.\d ), (?P<iterations>\d ) iterations\)")

m = pattern.match(s)

Because of the named capture groups, you can get each value individually:

m = pattern.match(s)
print(m.group('resolution')) # 1200
print(m.group('time')) # 16.255
print(m.group('gflops')) # 7.920
# ...

But it won't match any string that isn't formatted exactly like the one you supplied. For example:

assert pattern.match("90234.12 °C on Core 12") is None

CodePudding user response:

import re
s = 'Resolution:  1200, Time: 16.255 (7.920 GFlop => 1487.23 MFlop/s, residual 0.007113, 500 iterations)'

p = re.findall(r'\d [\.]\d |\d ',s)
print(p)

OUTPUT:

['1200', '16.255', '7.920', '1487.23', '0.007113', '500']

CodePudding user response:

The following splits a string in parts according to numbers found in it. The parts always alternate between non-numerical ones and numerical ones. Many types of floats are accepted, following this SO answer.

In the result:

  • elements [0::2] are non-numerical;
  • elements [1::2] are either int or float.
# adapted from: https://stackoverflow.com/a/55592455/758174
float_pat = re.compile(r'([ -]?(?:\d (?:[.]\d*)?(?:[eE][ -]?\d )?|[.]\d (?:[eE][ -]?\d )?))')

def int_or_float(s):
    try:
        return int(s)
    except ValueError:
        return float(s)

def split_numerical(s):
    a = re.split(float_pat, s)
    a[1::2] = map(int_or_float, a[1::2])
    return a

On your string:

s = 'Resolution:  1200, Time: 16.255 (7.920 GFlop => 1487.23 MFlop/s, residual 0.007113, 500 iterations)'
>>> split_numerical(s)
['Resolution:  ',
 1200,
 ', Time: ',
 16.255,
 ' (',
 7.92,
 ' GFlop => ',
 1487.23,
 ' MFlop/s, residual ',
 0.007113,
 ', ',
 500,
 ' iterations)']
  • Related