How to regex to match sentences in a pdf?-CodePudding

I have the following pdf file I use PyPDF2 to extract text from it pdf image

and I'm looking for a regex to capture numbered sentences in the pdf file

I tried a couple of regex in the following code but the output is not as needed I need to capture the numbered points each as one sentence like this

expected OUTPUT

['1. Please admit that Plaintiff, JOSHUA PINK, received benefits from a collateral
source, as defined by §768.76, Florida Statutes, for medical bills alleged to have been incurred as
a result of the incident described in the Complaint.',2. please.....]

Instead of two regexes I tried either doesn't capture the full sentence or capture it in multiple lines and consider every \n as a new sentence

Extracted TEXT

" \n IN THE CIRCUIT COURT, OF THE \nEIGHTEENTH JUDICIAL CIRCUIT, IN \nAND FOR SEMINOLE COUNTY, \nFLORIDA  \n \nCASE NO: 2022 -CA-002235  \n \nJOSHUA PINK,  \n \n Plaintiff,  \nvs. \n \nMATHEW ZUMBRUM , \n \n Defendant.  \n                                                                      / \n \nDEFENDANT'S REQUEST FOR ADMISSIONS TO PLAINTIFF, JOSHUA PINK  \n \n \nCOME NOW the Defendant , MATHEW ZUMBRUM , by and through the undersigned \nattorneys, and pursuant to Rule 1.370, Florida Rul es of Civil Procedure, requests the Plaintiff, \nJOSHUA PINK, admit in this action that each of the following statements are true:  \n1. Please admit that Plaintiff, JOSHUA PINK , received benefits from a collateral \nsource, as defined by §768.76, Florida Statute s, for medical bills alleged to have been incurred as \na result of the incident described in the Complaint.  \n2. Please admit that Plaintiff, JOSHUA PINK , received benefits from a collateral \nsource, as defined by §768.76, Florida Statutes, for loss of wages o r income alleged to have been \nsustained as a result of the incident described in the Complaint.  \n3. Please admit that Plaintiff, JOSHUA PINK , received benefits under the Personal \nInjury Protection portion of an automobile policy for medical bills alleged to  have been incurred \nas a result of the incident described in the Complaint.  \n Filing # 162442429 E-Filed 12/06/2022 09:46:49 AM\n \n2 4. Please admit that Plaintiff, JOSHUA PINK , received benefits under the Personal \nInjury Protection portion of an automobile insurance policy for loss of wages or income alleged \nto have been sustained as a result of the incident described in the Complaint.  \n5. Please admit that Plaintiff, JOSHUA PINK , received benefits under the medical \npayments provisions of an automobile insurance policy for medical bills alleged to have been \nincurred as a result of the incident described in the Complaint.  \n6. Please admit that Plaintiff, JOSHUA PINK , is subject to a deductible under the \nPersonal Injury Protection portion of an automobile insurance policy.  \n7. Please admit that Plaintiff, JOSHUA PINK  received benefits pursuant to personal \nor group health insurance policy, for medical bills alleged to have been incurred as a result of the \nincident described in the Complaint.  \n8. Please admit that Plaintiff, JOSHUA  PINK , received benefits pursuant to a \npersonal or group wage continuation plan or policy, for loss of wages or income alleged to have \nbeen sustained as a result of the incident described in the Complaint.  \n 9. Please admit that on the date of the accident alleged in your Complaint, Defendant, \nMATHEW ZUMBRUM , complied with and met the security requirements under Chapter \n627.730 - 627.7405, Florida Statutes.  \n10. Please admit that Plaintiff, JOSHUA PINK , was partially responsible for the \nsubject accident.  \n11. Please admit that Plaintiff, JOSHUA PINK , did NOT  suffer a permanent injury as \na result of the subject accident.  \nI HEREBY CERTIFY that on the 6th day of December, 2022 a true and correct copy of \nthe foregoing was electronically filed with the Florida Court s E-Filing Portal system which will \n \n3 send a notice of electronic filing to Michael R. Vaughn, Esq., Morgan & Morgan, P.A., 20 N. \nOrange Ave, 16th Floor, Orlando, FL 32801 at [email protected]; \[email protected]; [email protected].  \nAND REW J. GORMAN & ASSOCIATES  \n \nBY: \n \n(Original signed electronically by Attorney.)  \nLOURDES CALVO -PAQUETTE, ESQ.  \nAttorney for Defendant, Zumbrum  \n390 N. Orange Avenue, Suite 1700  \nOrlando, FL 32801  \nTelephone:  (407) 872 -2498  \nFacsímile:  (855) 369 -8989  \nFlorida Bar No.  0817295  \nE-mail for service (FL R. Jud. Admin. 2.516) : \nflor.law [email protected]  \n \nAttorneys and Staff of Andrew J. Gorman & \nAssociates are Employees of the Law Department \nof State Farm Mutual Automobile Insurance \nCompany.  \n \n \n\n"

sample output of regex2 (sentence is captured in 2 lines)

[('2022', 'CA-002235 '),
 ('1', 'Florida Rul es of Civil Procedure, requests the Plaintiff,'),
 ('1',
  'Please admit that Plaintiff, JOSHUA PINK , received benefits from a collateral'),
 ('768',
  'Florida Statute s, for medical bills alleged to have been incurred as'),...]

sample output of regex1 (not capturing full sentence) 

['1. Please admit that Plaintiff, JOSHUA PINK , received benefits from a collateral ',
 '2. Please admit that Plaintiff, JOSHUA PINK , received benefits from a collateral ',
 '3. Please admit that Plaintiff, JOSHUA PINK , received benefits under the Personal ',
 '2 4. Please admit that Plaintiff, JOSHUA PINK , received benefits under the Personal ',
 '5. Please admit that Plaintiff, JOSHUA PINK , received benefits under the medical ',....]

code:

def read_pdf(name):
    reader = PdfReader(name,"rb")
    text = ""
    for page in reader.pages:
        text  = page.extract_text()   "\n"

    #regex1 = r'(^[0-9].*)'
    regex2 = r'([\d] ). ?([a-zA-Z]. ).'
    pat = re.compile(regex, re.M)

    extracted_text = pat.findall(text)

    return text,extracted_text

text,pdf1 = read_pdf(names[0])

CodePudding user response：

If you want to match sentences followed by a dot, you might use:

\b\d \.[^\S\n][^.]*(?:\.(?=\S)[^.]*)*\.

Explanation

\b A word boundary to prevent a partial word match
\d \.[^\S\n] Match 1 digits, a dot and a space
[^.]*(?:\.(?=\S)[^.]*)* Optionally match any character except for dots, and then only match the dot when there is a non whitespace character following.
\. Match a dot

See a regex demo.

A pattern with more punctuation characters:

\b\d \.[^\S\n][^.!?]*(?:[.!?](?=\S)[^.!?]*)*[.!?]

See another regex demo.

CodePudding user response：

I'll provide an answer to go over a couple of different patterns you can use to approach text items like that. Let's say you have a text that is structured like this:

test_str = """
Some preamble.
    1. Very
long
sentence.
    2. One-line sentence.
    3. Another
longer sentence.
A new paragraph.
"""

First scenario: you want to match items that begin with a number followed by a period at the beginning of a line (with optional leading space) and end with a period at the end of a line - irrespective of how many characters it takes, but as few as possible. That's what your question reads like. One pattern that describes this is ^[ \t]*\d \.[\s\S]*?\.$. The heavy lifting here is done by [\s\S]*? which is a lazy class that just matches any character (by including all spaces and all non-spaces) as few times as possible.

regex1 = re.compile(r"^[ \t]*\d \.[\s\S]*?\.$", re.MULTILINE)
print(re.findall(regex1, test_str))

Which returns:

[' 1. Very\nlong\nsentence.', ' 2. One-line sentence.', ' 3. Another\nlonger sentence.']

If you want to exclude leading space, you could add a capturing group ^[ \t]*(\d \.[\s\S]*?\.)$ in which case findall() will only return the captured part. In Python:

regex2 = re.compile(r"^[ \t]*(\d \.[\s\S]*?\.)$", re.MULTILINE)
print(re.findall(regex2, test_str))

Which returns:

['1. Very\nlong\nsentence.', '2. One-line sentence.', '3. Another\nlonger sentence.']

First scenario, alternative expression: after the leading number, express the match in terms of lines; always get the first line and add every following line as long as the preceding line does not end in a period: ^[ \t]*(\d \..*(?:[^.]$\r?\n.*)*\.)$. This will be faster than the lazy class in the first example and returns the same as with regex2.

regex3 = re.compile(r"^[ \t]*(\d \..*(?:[^.]$\r?\n.*)*\.)$", re.MULTILINE)
print(re.findall(regex3, test_str))

Second scenario: we don't care what the sentence(s) end in. Just get complete items, which we'll interpret as the leading number followed by all lines that do not start with another leading number or an entirely new paragraph: ^[ \t]*(\d \.. $(?:\r?\n(?![ \t]*\d \.|A new).*)*).

This makes use of a negative lookahead (?![ \t]*\d \.|A new) to prevent matching lines that start either with a new item number or some non-item text and allows more control over what kind of lines may constitute an item. Return values are the same.

regex4 = re.compile(r"^[ \t]*(\d \.. $(?:\r?\n(?![ \t]*\d \.|A new).*)*)", re.MULTILINE)
print(re.findall(regex4, test_str))

CodePudding user response：

Try this:

(\d \.\s)(.|\n)*?(?=\d \.\s|\z|\.\s)

This will match from any number followed by a period and a space to the end of the sentence (period followed by a space) or until the next number followed by a period and a space or the end of the string. See example here