Given a data stream, how do I extract the information (between \x15
and \x15\n
) that comes right after a *PAR
?
Here is the data stream.
'%gra:bla bla bla\n',
'*PAR:\tthe cat wants\n',
'\tcookies . \x159400_14000\x15\n',
'%mor:\tdet:art|the adj|cat\n',
'\tpro:rel|which aux|be&3S part|tip-PRESP coord|and pro:sub|he\n',
'\tv|want-3S n|cookie-PL .\n',
'%gra:\t1|3|DET 2|3|MOD 3|4|SUBJ 4|0|ROOT 5|4|JCT 6|7|DET 7|5|POBJ 8|10|LINK\n',
'\t9|10|AUX 10|7|CMOD 11|13|LINK 12|13|SUBJ 13|10|CJCT 14|13|OBJ 15|16|INF\n',
'\t16|14|XMOD 17|16|JCT 18|19|DET 19|17|POBJ 20|4|PUNCT\n',
'*PAR:\ cookies biscuit\n',
'\tis eating a cookie . \x1514000_18647\x15\n',
My output should be:
"9400_14000"
"14000_18647"
...
CodePudding user response:
Go over the data line by line and try to look for the desired pattern just on lines that follow *PAR
:
import re
[re.search('\x15(.*)\x15', line).groups()[0]
for i, line in enumerate(data) if '*PAR' in data[i - 1]]
This code will throw an exception if the pattern cannot be matched on a line that follows *PAR
. To get all valid matches use:
[match
for i, line in enumerate(data) if '*PAR' in data[i - 1]
for match in re.findall('\x15(.*)\x15', line)]
CodePudding user response:
I like to use this function
def get_between(string:str,start:str,end:str,with_brackets=True):
#re.escape escapes all characters that need to be escaped
new_start=re.escape(start)
new_end=re.escape(end)
pattern=f"{new_start}.*?{new_end}"
res=re.findall(pattern,string)
if with_brackets:
return res #-> this is with the brackets
else:
return [x[len(start):-len(end)] for x in res]#-> this is without the brackets
To use it in your example do this:
result = []
for i,string in enumerate(data_stream):
if i>0 and "*PAR" in data_stream[i-1]:
result =get_between(string,"\x15","\x15\n",False)
print(result)