Using regex to export data in PDF file to excel-CodePudding

I am using regex to get certain strings in a PDF file and write them to an excel file. The content of my PDF file is as follows:

Task 1: Question 1? answer1
Task 2: Question 2? (Format:****) answer2
Task 3: Question 3? answer3
Task 4: Question 4? (Format:*****) answer4

What I want to do is ignore the parts that say (Format:****).., for others the regex works fine, how can I do that?, so excel should be like below.

Excel

here my code:

import re
import pandas as pd
from pdfminer.high_level import extract_pages, extract_text

text = extract_text("file.pdf")

pattern1 = re.compile(r":\s*(.*\?)")
pattern2 = re.compile(r".*\?\s*(.*)")
matches1 = pattern1.findall(text)
matches2 = pattern2.findall(text)
df = pd.DataFrame({'Soru-TR': matches1})
df['Cevap'] = matches2
writer = pd.ExcelWriter('Questions.xlsx', engine='xlsxwriter')
df.to_excel(writer, sheet_name='Sheet1', index=False)
writer.save()

CodePudding user response：

You can use a single pattern with 2 capture groups, and optionally match a part between parenthesis after matching the question mark.

^[^:]*:\s*([^?] \?)\s (?:\([^()]*\)?\s)?(.*)

Explanation

^ Start of string
[^:]*: Match any char except : and then match :
\s* Match optional whitespace cahrs
([^?] \?) Capture group 1, match 1 chars other than ? and then match ?
\s Match 1 whitspace chars
(?:\([^()]*\)?\s)? Optionally match from an opening till closing (...)
(.*) Capture group 2, match the rest of the line

See a regex demo.

Example code

import re

pattern = r"^[^:]*:\s*([^?] \?)\s (?:\([^()]*\)?\s)?(.*)"

s = ("Task 1: Question 1? answer1\n"
            "Task 2: Question 2? (Format:****) answer2\n"
            "Task 3: Question 3? answer3\n"
            "Task 4: Question 4? (Format:*****) answer4")

matches = re.finditer(pattern, s, re.MULTILINE)
matches1 = []
matches2 = []
for matchNum, match in enumerate(matches, start=1):
    matches1.append(match.group(1))
    matches2.append(match.group(2))

print(matches1)
print(matches2)

Output

['Question 1?', 'Question 2?', 'Question 3?', 'Question 4?']
['answer1', 'answer2', 'answer3', 'answer4']