Home > front end >  REGEX Match number in a line with a keyword
REGEX Match number in a line with a keyword

Time:10-07

I tried many patterns, but cannot get the correct result.

I want to match only float when the line has keyword range at the begining. My trouble is that the range can follow by a :, : , :, :, : , etc.

My best try is to use two patterns:

#1. (?i)(?<=range[: ])[:a-zA-Z0-9.$ -]

#2. [0-9.]

First run regex with the pattern #1, then get the ouput of pattern #1 and run regex one more time with pattern #2

How can I do that in one single pattern? Thanks so much

One more thing: my code is Python

Input: range: $0.82 --> Expected output: 0.82

Input: range:0.82 --> Expected output: 0.82

Input: range: 0.82 - 0.85 --> Expected output: 0.82, 0.85

Input: range : 0.82 - 0.85 --> Expected output: 0.82, 0.85

Input: range : 0.82 - 0.85 --> Expected output: 0.82, 0.85

Input: range 0.82 0.85 --> Expected output: 0.82, 0.85

CodePudding user response:

If you can make use of the Pythonregex PyPi module Then you can get multiple occurrences:

(?<=^range\b[\s:$-\d.]*)\d (?:\.\d )?

Explanation

  • (?<= Positive lookbehind, assert that to the left is
    • ^range\b Match range at the start of the string
    • [\s:$-\d.]* Optionally match all allowed chars that could be in between
  • ) Close the lookbehind assertion
  • \d (?:\.\d )? Match 1 digits with an optional decimal part

Regex demo | Python demo

Example

import regex

strings = [
"range: $0.82",
"range:0.82",
"range:  0.82 - 0.85",
"range : 0.82 - 0.85",
"range   :  0.82 - 0.85",
"range 0.82   0.85"
]
pattern = r"(?<=^range\b[\s:$-\d.]*)\d (?:\.\d )?"

for s in strings:
    print (regex.findall(pattern, s))

Output

['0.82']
['0.82']
['0.82', '0.85']
['0.82', '0.85']
['0.82', '0.85']
['0.82', '0.85']

CodePudding user response:

You could avoid regex completely. Those lines are not difficult to parse.

def parse(line):
    if not line.startswith('range'):
        return
    line = line.replace(':',' ').replace('$','')
    for token in line.split():
        try:
            yield float(token)
        except ValueError:
            continue
            

input_data = ['range: $0.82',
              'range:0.82',
              'range:  0.82 - 0.85',
              'range : 0.82 - 0.85',
              'range   :  0.82 - 0.85',
              'range 0.82   0.85']

r = [list(i) for i in map(parse, input_data)]
print(r)
[[0.82], [0.82], [0.82, 0.85], [0.82, 0.85], [0.82, 0.85], [0.82, 0.85]]

CodePudding user response:

This seems to work for me - however - there are probably a number of more efficient ways of doing it:

import re

input_data = ['range: $0.82',
              'range:0.82',
              'range:  0.82 - 0.85',
              'range : 0.82 - 0.85',
              'range   :  0.82 - 0.85',
              'range 0.82   0.85']

for i in range(len(input_data)):
    output = re.findall(r'(range)(\s*:?\s*[$]*)([0-9]*.[0-9]*)(\s*-?\s*)([0-9]*.[0-9]*)?', input_data[i])
    a = list(output[0])[2]
    b = list(output[0])[4]
    print(f'Input: {input_data[i]} --> Expected output: {a} , {b}')

OUTPUT:

Input: range: $0.82 --> Expected output: 0.82 , 
Input: range:0.82 --> Expected output: 0.82 , 
Input: range:  0.82 - 0.85 --> Expected output: 0.82 , 0.85
Input: range : 0.82 - 0.85 --> Expected output: 0.82 , 0.85
Input: range   :  0.82 - 0.85 --> Expected output: 0.82 , 0.85
Input: range 0.82   0.85 --> Expected output: 0.82 , 0.85

You could also add some IF-statements to check to see if 'b' is empty, and control the output as required. However, I think the main thing that you wanted to achieve was a single REGEX statement that could extract the two numbers in question (if available).

Regex statement explanation:

r'(range)(\s*:?\s*[$]*)([0-9]*.[0-9]*)(\s*-?\s*)([0-9]*.[0-9]*)?'

First Group: (range)

This puts 'range' into the first group.

Second Group: (\s*:?\s*[$]*)

  • \s* matches zero or more whitespace characters
  • :? matches an optional colon (:)
  • [$]* matches zero or more dollar signs ($)

Third Group: ([0-9]*.[0-9]*)

  • [0-9]* matches zero or more numbers
  • . matches a decimal point
  • this is the group that relates to the number (0.82)

Fourth Group: (\s*-?\s*)

  • \s* matches zero or more whitespace characters
  • -? matches an optional hyphen

Fifth Group: ([0-9]*.[0-9]*)?

  • [0-9]* matches zero or more numbers
  • . matches a decimal point
  • The ? at the end suggests that the group is optional.
  • This is the group that holds the second number (0.85)

CodePudding user response:

You could use this regex to extract your data:

^\s*range\D*(\d (?:\.\d )?)(?:\D*(\d (?:\.\d )?))?

Regex explanation:

  • ^ : beginning of string
  • \s*range : asserts the string starts with range (possibly preceded by whitespace, if you don't want that remove the \s*
  • \D* : some number of non-digit characters
  • (\d (?:\.\d )?) : a number, captured in group 1
  • (?:\D*(\d (?:\.\d )?))? an optional group of some non-digits followed by a number, captured in group 2

In python

import re

input_data = ['range: $0.82',
              'range:0.82',
              'range:  0.82 - 0.85',
              'range : 0.82 - 0.85',
              'range   :  0.82 - 0.85',
              'range 0.82   0.85']
results = [re.findall(r'^\s*range\D*(\d (?:\.\d )?)(?:\D*(\d (?:\.\d )?))?', d)[0] for d in input_data]
print(results)

Output:

[
 ('0.82', ''),
 ('0.82', ''),
 ('0.82', '0.85'),
 ('0.82', '0.85'),
 ('0.82', '0.85'),
 ('0.82', '0.85')
]
  • Related