I have a long document in which the line of my interest starts with Categories :
. I want to find all words separated by ,
after Categories :
.
Here's an example line
Categories : Turbo Prop , Very Light , Light , Mid Size
I want to find start index
and end index
of Turbo Prop
, Very Light
, Light
, Mid Size
I am using following code
regex_pattern = r"(?<=Categories : )([A-Za-z ] (?:,)?) "
matched_text = regex.search(regex_pattern,doc_tex)
But matched_text.groups()
is only giving Mid Size
. In short, I want to find all occurences of group 1
after Categories
.
CodePudding user response:
Do it in two steps. First split the line using :
, then split the second part using ,
.
category_string = line.split(':')[1]
categories = category_string.split(',')
CodePudding user response:
It looks like the comments answered the OP's question, but for completeness I thought I'd post the answer they discuss. It looks like Python's re module does not store all all instances of a repeated capture group; see issue 7132. The regex package, however, adds additional methods to handle repeated capture groups, including.
- captures -Returns a list of the strings matched in a group or groups.
- starts - Returns a list of the start positions.
- ends - Returns a list of the end positions.
- spans - Returns a list of the spans. Compare with matchobject.span([group]).
Hence, using the regex package with the matchedobject.starts
and matchedobject.ends
methods should work.