I am trying to extract the comma delimited numbers inside () brackets from a string. I can get the numbers if that are alone in a line. But i cant seem to find a solution to get the numbers when other surrounding text is involved. Any help will be appreciated. Below is the code that I current use in python.
line = """
Abuta has a history of use in the preparation of curares, an arrow poison to cause asphyxiation in hunting
It has also been used in traditional South American and Indian Ayurvedic medicines (101065,101066,101067)
The genus name Cissampelos is derived from the Greek words for ivy and vine (101065)
"""
line = each.strip()
regex_criteria = r'"^([1-9][0-9]*|\([1-9][0-9]*\}|\(([1-9][0-9]*,?) [1-9][0-9]*\))$"gm'
if (line.__contains__('(') and line.__contains__(')') and not re.search('[a-zA-Z]', refline)):
refline = line[line.find('(') 1:line.find(')')]
if not re.search('[a-zA-Z]', refline):
CodePudding user response:
Remove the ^
, $
is whats preventing you from getting all the numbers. And gm
flags wont work in python re
.
You can change your regex to :([1-9][0-9]*|\([1-9][0-9]*\}|\(?:([1-9][0-9]*,?) [1-9][0-9]*\))
if you want to get each number separately.
Or you can simplify your pattern to (?<=[(,])[1-9][0-9] (?=[,)])
Test regex here: https://regex101.com/r/RlGwve/1
Python code:
import re
line = """
Abuta has a history of use in the preparation of curares, an arrow poison to cause asphyxiation in hunting
It has also been used in traditional South American and Indian Ayurvedic medicines (101065,101066,101067)
The genus name Cissampelos is derived from the Greek words for ivy and vine (101065)
"""
print(re.findall(r'(?<=[(,])[1-9][0-9] (?=[,)])', line))
# ['101065', '101066', '101067', '101065']
(?<=[(,])[1-9][0-9] (?=[,)])
The above pattern tells to match numbers which begin with 1-9 followed by one or more digits, only if the numbers begin with or end with either comma or brackets.
CodePudding user response:
Here's another option:
pattern = re.compile(r"(?<=\()[1-9] \d*(?:,[1-9]\d*)*(?=\))")
results = [match[0].split(",") for match in pattern.finditer(line)]
(?<=\()
: Lookbehind for(
[1-9] \d*
: At least one number (would\d
work too?)(?:,[1-9]\d*)*
: Zero or multiple numbers after a,
(?=\))
: Lookahead for)
Result for your line
:
[['101065', '101066', '101067'], ['101065']]
If you only want the comma separated numbers:
pattern = re.compile(r"(?<=\()[1-9] \d*(?:,[1-9]\d*) (?=\))")
results = [match[0].split(",") for match in pattern.finditer(line)]
(?:,[1-9]\d*)
: One or more numbers after a,
Result:
[['101065', '101066', '101067']]
Now, if your line could also look like
line = """
Abuta has a history of use in the preparation of curares, an arrow poison to cause asphyxiation in hunting
It has also been used in traditional South American and Indian Ayurvedic medicines ( 101065,101066, 101067 )
The genus name Cissampelos is derived from the Greek words for ivy and vine (101065)
"""
then you have to sprinkle the pattern
with \s*
and remove the whitespace afterwards (here with str.translate
and str.maketrans
):
pattern = re.compile(r"(?<=\()\s*[1-9] \d*(?:\s*,\s*[1-9]\d*\s*)*(?=\))")
table = str.maketrans("", "", " ")
results = [match[0].translate(table).split(",") for match in pattern.finditer(line)]
Result:
[['101065', '101066', '101067'], ['101065']]
CodePudding user response:
Using the pypi regex module you could also use capture groups:
\((?P<num>\d )(?:,(?P<num>\d ))*\)
The pattern matches:
\(
Match(
(?P<num>\d )
Capture group, match 1 digits(?:,(?P<num>\d ))*
Optionally repeat matching,
and 1 digits in a capture group\)
Match)
Example code
import regex
pattern = r"\((?P<num>\d )(?:,(?P<num>\d ))*\)"
line = """
Abuta has a history of use in the preparation of curares, an arrow poison to cause asphyxiation in hunting
It has also been used in traditional South American and Indian Ayurvedic medicines (101065,101066,101067)
The genus name Cissampelos is derived from the Greek words for ivy and vine (101065)
"""
matches = regex.finditer(pattern, line)
for _, m in enumerate(matches, start=1):
print(m.capturesdict())
Output
{'num': ['101065', '101066', '101067']}
{'num': ['101065']}