Extract values at specific location in Python-CodePudding

Given a data stream, how do I extract the information (between \x15 and \x15\n) that comes right after a *PAR?

Here is the data stream.

'%gra:bla bla bla\n',
'*PAR:\tthe cat wants\n',
'\tcookies . \x159400_14000\x15\n',
'%mor:\tdet:art|the adj|cat\n',
'\tpro:rel|which aux|be&3S part|tip-PRESP coord|and pro:sub|he\n',
'\tv|want-3S n|cookie-PL .\n',
'%gra:\t1|3|DET 2|3|MOD 3|4|SUBJ 4|0|ROOT 5|4|JCT 6|7|DET 7|5|POBJ 8|10|LINK\n',
'\t9|10|AUX 10|7|CMOD 11|13|LINK 12|13|SUBJ 13|10|CJCT 14|13|OBJ 15|16|INF\n',
'\t16|14|XMOD 17|16|JCT 18|19|DET 19|17|POBJ 20|4|PUNCT\n',
'*PAR:\ cookies biscuit\n',
'\tis eating a cookie . \x1514000_18647\x15\n',

My output should be:

"9400_14000"
"14000_18647"
...

CodePudding user response：

Go over the data line by line and try to look for the desired pattern just on lines that follow *PAR:

import re
[re.search('\x15(.*)\x15', line).groups()[0]
 for i, line in enumerate(data) if '*PAR' in data[i - 1]]

This code will throw an exception if the pattern cannot be matched on a line that follows *PAR. To get all valid matches use:

[match
 for i, line in enumerate(data) if '*PAR' in data[i - 1]
 for match in re.findall('\x15(.*)\x15', line)]

CodePudding user response：

I like to use this function

def get_between(string:str,start:str,end:str,with_brackets=True):
    #re.escape escapes all characters that need to be escaped
    new_start=re.escape(start)
    new_end=re.escape(end)
    pattern=f"{new_start}.*?{new_end}"
    res=re.findall(pattern,string)
    if with_brackets:
        return res #-> this is with the brackets
    else:
        return [x[len(start):-len(end)] for x in res]#-> this is without the brackets

To use it in your example do this:

result = []
for i,string in enumerate(data_stream):
 if i>0 and "*PAR" in data_stream[i-1]:
  result =get_between(string,"\x15","\x15\n",False)
print(result)