Home > Back-end >  Regular expression to extract number with hyphen
Regular expression to extract number with hyphen

Time:09-29

The text is like "1-2years. 3years. 10years."

I want get result [(1,2),(3),(10)].

I use python.

I first tried r"([0-9]?)[-]?([0-9])years". It works well except for the case of 10. I also tried r"([0-9]?)[-]?([0-9]|10)years" but the result is still [(1,2),(3),(1,0)].

CodePudding user response:

This should work:

import re

st = '1-2years. 3years. 10years.'
result = [tuple(e for e in tup if e) 
          for tup in re.findall(r'(?:(\d )-(\d )|(\d ))years', st)]
# [('1', '2'), ('3',), ('10',)]

The regex will look for either one number, or two separated by a hyphen, immediately prior to the word years. If we give this to re.findall(), it will give us the output [('1', '2', ''), ('', '', '3'), ('', '', '10')], so we also use a quick list comprehension to filter out the empty strings.

Alternately we could use r'(\d )(?:-(\d ))?years' to basically the same effect, which is closer to what you've already tried.

CodePudding user response:

You can use this pattern: (?:(\d )-)?(\d )years

See Regex Demo

Code:

import re

pattern = r"(?:(\d )-)?(\d )years"
text = "1-2years. 3years. 10years."
print([tuple(int(z) for z in x if z) for x in re.findall(pattern, text)])

Output:

[(1, 2), (3,), (10,)]

CodePudding user response:

Your attempt r"([0-9]?)[-]?([0-9])years" doesn't work for the case of 10 because you ask it to match one (or zero) digit per group.

You also don't need the hyphen in brackets.

This should work: Regex101

(\d )(?:-(\d ))?years

Explanation:

  • (\d ): Capturing group for one or more digits
  • (?: ): Non-capturing group
  • - : hyphen
  • (\d ): Capturing group for one or more digits
  • (?: )?: Make the previous non-capturing group optional

In python:

import re

result = re.findall(r"(\d )(?:-(\d ))?years", "1-2years. 3years. 10years.")

# Gives: [('1', '2'), ('3', ''), ('10', '')]

Each tuple in the list contains two elements: The number on the left side of the hyphen, and the number on the right side of the hyphen. Removing the blank elements is quite easy: you loop over each item in result, then you loop over each match in this item and only select it (and convert it to int) if it is not empty.

final_result = [tuple(int(match) for match in item if match) for item in result]

# gives: [(1, 2), (3,), (10,)]

CodePudding user response:

You only match a single digit as the character class [0-9] is not repeated.

Another option is to match the first digits with an optional part for - and digits.

\b(\d (?:-\d )?)years\.
  • \b A word boundary
  • ( Capture group 1 (which will be returned by re.findall)
    • \d (?:-\d )? Match 1 digits and optionally match - and again 1 digits
  • ) Close group 1
  • years\. Match literally with the escaped .

Regex demo

Then you can split the matches on -

pattern = r"\b(\d (?:-\d )?)years\."
s = "1-2years. 3years. 10years."

res = [tuple(v.split('-')) for v in re.findall(pattern, s)]
print(res)

Output

[('1', '2'), ('3',), ('10',)]

Or if a list of lists is also ok instead of tuples

res = [v.split('-') for v in re.findall(pattern, s)]

Output

[['1', '2'], ['3'], ['10']]
  • Related