Sample of text file in a directory:
text1 = "
SomethingAAA SomethingAAA SomethingAAA SomethingAAA
SomethingBBB SomethingBBB SomethingBBB SomethingBBB
SomethingCCC SomethingCCC SomethingCCC SomethingCCC
SomethingDDD SomethingDDD SomethingDDD SomethingDDD
BlahBlah SomethingXXX BlahBlah BlahBlah BlahBlah BlahBlah
SomethingEEE SomethingEEE SomethingEEE SomethingEEE
SomethingFFF SomethingFFF SomethingFFF SomethingFFF
SomethingGGG SomethingGGG SomethingGGG SomethingGGG
BlahBlah BlahBlah BlahBlah SomethingYYY BlahBlah
SomethingGGG SomethingGGG SomethingGGG SomethingGGG
"
I have two regex patterns I use to identify the strings in the texts:
pattern1 = re.compile(r'\w XXX')
pattern2 = re.compile(r'\w YYY')
The goal is to save the lines containing the patterns plus the preceding line and following line in a new text file.
So the desired output would be:
newtext = "
SomethingDDD SomethingDDD SomethingDDD SomethingDDD
BlahBlah SomethingXXX BlahBlah BlahBlah BlahBlah BlahBlah
SomethingEEE SomethingEEE SomethingEEE SomethingEEE
SomethingGGG SomethingGGG SomethingGGG SomethingGGG
BlahBlah BlahBlah BlahBlah SomethingYYY BlahBlah
SomethingGGG SomethingGGG SomethingGGG SomethingGGG
"
What I'm doing now is:
relevant piece of code:
previous_line = deque()
for text_doc in text_docs:
with open(text_doc,'r') as f:
for line in f:
nextline = next(f).strip()
prev_line.appendleft(line)
with open(
output, "a"
) as Results:
if re.search(pattern1, line):
previous_line = "".join(previous_line.popleft())
found_pattern1 = previous_line line nextline
Results.write(f"\n\nInstance of pattern1: \n{found_pattern1}\n\n")
elif re.search(pattern2, line):
previous_line = "".join(previous_line.popleft())
found_pattern2 = previous_line line nextline
Results.write(f"\n\nInstance of pattern2: \n{found_pattern2}\n\n")
prev_line.clear()
what I'm getting, however, is:
newtext = "
BlahBlah SomethingXXX BlahBlah BlahBlah BlahBlah BlahBlah
BlahBlah SomethingXXX BlahBlah BlahBlah BlahBlah BlahBlah
SomethingEEE SomethingEEE SomethingEEE SomethingEEE
BlahBlah BlahBlah BlahBlah SomethingYYY BlahBlah
BlahBlah BlahBlah BlahBlah SomethingYYY BlahBlah
SomethingGGG SomethingGGG SomethingGGG SomethingGGG"
What is it that I'm doing wrong and what do I have to change to achieve my goal?
CodePudding user response:
You can join pattern1
and pattern2
with an alternation pattern and include the preceding and following lines with (?:.*\n)?
, and use re.findall
to find all matches:
patterns = [r'\w XXX', r'\w YYY']
new_text = '\n'.join(re.findall(rf"(?:.*\n)?(?:.*(?:{'|'.join(patterns)}).*\n) (?:.*\n)?", text1))
Demo: https://replit.com/@blhsing/RosybrownGargantuanAnalysts
CodePudding user response:
I managed to find a solution. Here it is in case it can help somebody:
output_lines = deque(maxlen=3)
for text_doc in text_docs:
print(text_doc)
with open(text_doc, 'r') as f:
lines = f.readlines()
for i, line in enumerate(lines):
if re.search(Pattern1, line) or re.search(Pattern2, line):
try:
output_lines.extend([lines[i-1], line, lines[i 1]])
except StopIteration:
pass #EOF
with open(
output.txt, "a"
) as Results:
complete_output = ''.join(output_lines)
Results.writelines('============================================\n')
Results.writelines(f"\nLines with pattern: \n{complete_output}\n\n")
Results.writelines('============================================\n')