I have a string pdf_text
(below)
pdf_text = """ Account History Report
IMAGE All Notes
Date Created:18/04/2022
Number of Pages: 4
Client Code - 110203 Client Name - AWS PTE. LTD.
Our Ref :2118881115 Name: Sky Blue Ref 1 :12-34-56789-2021/2 Ref 2:F2021004444
Amount: $100.11 Total Paid:$0.00 Balance: $100.11 Date of A/C: 01/08/2021 Date Received: 10/12/2021
Last Paid: Amt Last Paid: A/C Status: CLOSED Collector : Sunny Jane
Date Notes
04/03/2022 Letter Dated 04 Mar 2022.
Our Ref :2112221119 Name: Green Field Ref 1 :98-76-54321-2021/1 Ref 2:F2021001111
Amount: $233.88 Total Paid:$0.00 Balance: $233.88 Date of A/C: 01/08/2021 Date Received: 10/12/2021
Last Paid: Amt Last Paid: A/C Status: CURRENT Collector : Sam Jason
Date Notes
11/03/2022 Email for payment
11/03/2022 Case Status
08/03/2022 to send a Letter
08/03/2022 845***Ringing, No reply
21/02/2022 Letter printed - LET: LETTER 2
18/02/2022 Letter sent - LET: LETTER 2
18/02/2022 845***Line busy
"""
I need to split the string on the line Our Ref :Value Name: Value Ref 1 :Value Ref 2:Value
. Which is the start of every data entity below(in rectangles)
so that I get the squared entities(in above picture) in a different string.
I used the regex pattern
data_entity_sep_pattern = r'(Our Ref.*?Name.*?Ref 1.*?Ref 2.*?)'
But I don't see the separators being retained with the splitted lines.
split_on_data_entity = re.split(data_entity_sep_pattern, pdf_text.strip())
which gives me
which obviously was not expected. Expected was split_on_data_entity[1] and split_on_data_entity[2] be in one string and split_on_data_entity[3] and split_on_data_entity[4] to be in one string.
I was referring this answer https://stackoverflow.com/a/2136580/10216112 which explains parenthesis retains the string
CodePudding user response:
Expected was split_on_data_entity[1] and split_on_data_entity[2] be in one string
The parentheses retain the string, but in a separate chunk.
If you want to keep the string, but have it as part of the next chunk, use a look-ahead (?= )
Some other remarks:
You may also want to require that "Our ref" occurs as the first set of letters on a line. And when you are at it, you can remove such newline character, followed by optional white space.
There is no need to match
.*?
at the very end of your patternAs the text comes from PDF, you maybe don't want to be too strict about the number of spaces between words. You could use
\s
.
data_entity_sep_pattern = r'\n\s*(?=Our\s Ref.*?Name.*?Ref\s 1.*?Ref\s 2)'
split_on_data_entity = re.split(data_entity_sep_pattern, pdf_text)
for section in split_on_data_entity:
print(section)
print("--------------------------")