re.split retain the separator with the next result string-CodePudding

I have a string pdf_text(below)

pdf_text = """                                                      Account History Report 
           IMAGE                                                                                                                                 All Notes 
                                                                                                                                                                          Date Created:18/04/2022 
                                                                                                                                                                          Number of Pages: 4 
Client Code - 110203     Client Name - AWS PTE. LTD. 
Our Ref :2118881115       Name: Sky Blue                                              Ref 1 :12-34-56789-2021/2             Ref 2:F2021004444 
  
Amount: $100.11             Total Paid:$0.00          Balance: $100.11      Date of A/C: 01/08/2021                 Date Received: 10/12/2021  
 
 
Last Paid:                         Amt Last Paid: A/C     Status: CLOSED                                                               Collector : Sunny Jane 
Date                                  Notes  
04/03/2022                              Letter Dated 04 Mar 2022. 
Our Ref :2112221119      Name: Green Field                                            Ref 1 :98-76-54321-2021/1           Ref 2:F2021001111  
 
Amount: $233.88            Total Paid:$0.00            Balance: $233.88       Date of A/C: 01/08/2021               Date Received: 10/12/2021  
 
Last Paid:                        Amt Last Paid:             A/C Status: CURRENT                                                     Collector : Sam Jason 
Date                                Notes  
11/03/2022                      Email for payment  
11/03/2022                     Case Status  
08/03/2022                     to send a Letter  
08/03/2022                     845***Ringing, No reply  
21/02/2022                     Letter printed - LET: LETTER 2  
18/02/2022                     Letter sent - LET: LETTER 2  
18/02/2022                     845***Line busy 
 """

I need to split the string on the line Our Ref :Value Name: Value Ref 1 :Value Ref 2:Value . Which is the start of every data entity below(in rectangles)

so that I get the squared entities(in above picture) in a different string.

I used the regex pattern

data_entity_sep_pattern = r'(Our Ref.*?Name.*?Ref 1.*?Ref 2.*?)'

But I don't see the separators being retained with the splitted lines.

split_on_data_entity = re.split(data_entity_sep_pattern, pdf_text.strip())

which gives me

which obviously was not expected. Expected was split_on_data_entity[1] and split_on_data_entity[2] be in one string and split_on_data_entity[3] and split_on_data_entity[4] to be in one string.

I was referring this answer https://stackoverflow.com/a/2136580/10216112 which explains parenthesis retains the string

CodePudding user response：

Expected was split_on_data_entity[1] and split_on_data_entity[2] be in one string

The parentheses retain the string, but in a separate chunk.

If you want to keep the string, but have it as part of the next chunk, use a look-ahead (?= )

Some other remarks:

You may also want to require that "Our ref" occurs as the first set of letters on a line. And when you are at it, you can remove such newline character, followed by optional white space.
There is no need to match .*? at the very end of your pattern
As the text comes from PDF, you maybe don't want to be too strict about the number of spaces between words. You could use \s .

data_entity_sep_pattern = r'\n\s*(?=Our\s Ref.*?Name.*?Ref\s 1.*?Ref\s 2)'
split_on_data_entity = re.split(data_entity_sep_pattern, pdf_text)

for section in split_on_data_entity:
    print(section)
    print("--------------------------")