Home > Blockchain >  regex unicode multiline problem in Python
regex unicode multiline problem in Python

Time:11-30

I have some strings containing Unicode characters like bellow:

رده سنی مجاز : 
 10.2-15.3
 8.71-9.13
 25.08 - 31.2

زده های سنی غیرمجاز:
 16.5-18.4
 9.15 - 10.02
 20.02-21.30

I want to match the first number ranges like bellow:

10.2-15.3
8.71-9.13
25.08-31.2

and I'm using the following code:

print(re.findall('رده سنی مجاز :.*(.*\d .\d -\d .\d .*)', string, re.DOTALL))

but it returns:

['25.08-31.2']

CodePudding user response:

I suggest extracting all strings after the fixed text till a blank line, and then split the extracted part into separate lines:

import re
 
p = r"رده سنی مجاز :\s*\n(. (?:\n. )*)"
text = "رده سنی مجاز : \n 10.2-15.3\n 8.71-9.13\n 25.08 - 31.2\n\nزده های سنی غیرمجاز:\n 16.5-18.4\n 9.15 - 10.02\n 20.02-21.30"
m = re.search(p, text)
if m:
    print([x.strip() for x in m.group(1).splitlines()])

# => ['10.2-15.3', '8.71-9.13', '25.08 - 31.2']

See the Python demo and the regex demo.

Details:

  • رده سنی مجاز : - a fixed string
  • \s* - zero or more whitespaces
  • \n - a newline
  • (. (?:\n. )*) - one or more non-empty lines captured into Group 1.
  • Related