I am using regex to get certain strings in a PDF file and write them to an excel file. The content of my PDF file is as follows:
Task 1: Question 1? answer1
Task 2: Question 2? (Format:****) answer2
Task 3: Question 3? answer3
Task 4: Question 4? (Format:*****) answer4
What I want to do is ignore the parts that say (Format:****)
.., for others the regex works fine, how can I do that?, so excel should be like below.
here my code:
import re
import pandas as pd
from pdfminer.high_level import extract_pages, extract_text
text = extract_text("file.pdf")
pattern1 = re.compile(r":\s*(.*\?)")
pattern2 = re.compile(r".*\?\s*(.*)")
matches1 = pattern1.findall(text)
matches2 = pattern2.findall(text)
df = pd.DataFrame({'Soru-TR': matches1})
df['Cevap'] = matches2
writer = pd.ExcelWriter('Questions.xlsx', engine='xlsxwriter')
df.to_excel(writer, sheet_name='Sheet1', index=False)
writer.save()
CodePudding user response:
You can use a single pattern with 2 capture groups, and optionally match a part between parenthesis after matching the question mark.
^[^:]*:\s*([^?] \?)\s (?:\([^()]*\)?\s)?(.*)
Explanation
^
Start of string[^:]*:
Match any char except:
and then match:
\s*
Match optional whitespace cahrs([^?] \?)
Capture group 1, match 1 chars other than?
and then match?
\s
Match 1 whitspace chars(?:\([^()]*\)?\s)?
Optionally match from an opening till closing(...)
(.*)
Capture group 2, match the rest of the line
See a regex demo.
Example code
import re
pattern = r"^[^:]*:\s*([^?] \?)\s (?:\([^()]*\)?\s)?(.*)"
s = ("Task 1: Question 1? answer1\n"
"Task 2: Question 2? (Format:****) answer2\n"
"Task 3: Question 3? answer3\n"
"Task 4: Question 4? (Format:*****) answer4")
matches = re.finditer(pattern, s, re.MULTILINE)
matches1 = []
matches2 = []
for matchNum, match in enumerate(matches, start=1):
matches1.append(match.group(1))
matches2.append(match.group(2))
print(matches1)
print(matches2)
Output
['Question 1?', 'Question 2?', 'Question 3?', 'Question 4?']
['answer1', 'answer2', 'answer3', 'answer4']