Input:
ID aa
AA Homo sapiens
DR ac
BB ad
FT ae
//
ID ba
AA mouse
DR bc
BB bd
FT be
//
ID ca
AA Homo sapiens
DR cc
BB cd
FT ce
//
Expected output:
DR ac
FT ae
//
DR cc
FT ce
//
Code:
word = 'Homo sapiens'
with open(input_file, 'r') as txtin, open(output_file, 'w') as txtout:
for block in txtin.read().split('//\n'): # reading a file in blocks
if word in block: # extracted block containing the word, 'Homo sapiens'
extracted_block = block '//\n'
for line in extracted_block.strip().split('\n'): # divide each block into lines
if line.startswith('DR '):
dr = line
elif line.startswith('FT '):
ft = line
I read the input_file based on '//' (block). And, if the word 'Homo sapiens' is included in the blocks, I extracted the blocks. Also, in the block, the line starting with 'DR ' is defined as dr, and the line starting with 'FT ' is defined as ft. How should I write 'output' using dr and ft to get 'Expected output'?
CodePudding user response:
You can write a simple parser with a flag. In summary, when you reach a line with AA and the word, set the flag True to keep the following fields of interest, until you reach a block end in which case you reset the flag.
word = 'Homo sapiens'
with open(input_file, 'r') as txtin, open(output_file, 'w') as txtout:
keep = False
for line in txtin:
if keep and line.startswith(('DR', 'FT', '//')):
txtout.write(line)
if line.startswith('//'):
keep = False # reset the flag on record end
elif line.startswith('AA') and word in line:
keep = True
Output:
DR ac
FT ae
//
DR cc
FT ce
//
NB. This requires AA to be before the fields to save. If not, you have to parse block by block (keeping the data in memory) with a similar logic
CodePudding user response:
If you are open to a regex based solution, then one option would be to read the entire file into a string and then use re.findall
:
with open(input_file, 'r') as file:
inp = file.read()
matches = re.findall(r'(?<=//\n)ID.*?\nAA\s Homo sapiens\n(DR\s \w \n)BB\s \w \n(FT\s \w \n//\n?)', '//\n' inp)
for match in matches:
for line in match:
print(line, end='')
This prints:
DR ac
FT ae
//
DR cc
FT ce
//
Here is a demo showing that the pattern can correctly identify the blocks and DR/FT lines within each matching block.