Home > Net >  re.split retain the separator with the next result string
re.split retain the separator with the next result string

Time:06-18

I have a string pdf_text(below)

pdf_text = """                                                      Account History Report 
           IMAGE                                                                                                                                 All Notes 
                                                                                                                                                                          Date Created:18/04/2022 
                                                                                                                                                                          Number of Pages: 4 
Client Code - 110203     Client Name - AWS PTE. LTD. 
Our Ref :2118881115       Name: Sky Blue                                              Ref 1 :12-34-56789-2021/2             Ref 2:F2021004444 
  
Amount: $100.11             Total Paid:$0.00          Balance: $100.11      Date of A/C: 01/08/2021                 Date Received: 10/12/2021  
 
 
Last Paid:                         Amt Last Paid: A/C     Status: CLOSED                                                               Collector : Sunny Jane 
Date                                  Notes  
04/03/2022                              Letter Dated 04 Mar 2022. 
Our Ref :2112221119      Name: Green Field                                            Ref 1 :98-76-54321-2021/1           Ref 2:F2021001111  
 
Amount: $233.88            Total Paid:$0.00            Balance: $233.88       Date of A/C: 01/08/2021               Date Received: 10/12/2021  
 
Last Paid:                        Amt Last Paid:             A/C Status: CURRENT                                                     Collector : Sam Jason 
Date                                Notes  
11/03/2022                      Email for payment  
11/03/2022                     Case Status  
08/03/2022                     to send a Letter  
08/03/2022                     845***Ringing, No reply  
21/02/2022                     Letter printed - LET: LETTER 2  
18/02/2022                     Letter sent - LET: LETTER 2  
18/02/2022                     845***Line busy 
 """

I need to split the string on the line Our Ref :Value Name: Value Ref 1 :Value Ref 2:Value . Which is the start of every data entity below(in rectangles)

![enter image description here

so that I get the squared entities(in above picture) in a different string.

I used the regex pattern

data_entity_sep_pattern = r'(Our Ref.*?Name.*?Ref 1.*?Ref 2.*?)'

But I don't see the separators being retained with the splitted lines.

split_on_data_entity = re.split(data_entity_sep_pattern, pdf_text.strip())

which gives me

enter image description here

which obviously was not expected. Expected was split_on_data_entity[1] and split_on_data_entity[2] be in one string and split_on_data_entity[3] and split_on_data_entity[4] to be in one string.

I was referring this answer https://stackoverflow.com/a/2136580/10216112 which explains parenthesis retains the string

CodePudding user response:

Expected was split_on_data_entity[1] and split_on_data_entity[2] be in one string

The parentheses retain the string, but in a separate chunk.

If you want to keep the string, but have it as part of the next chunk, use a look-ahead (?= )

Some other remarks:

  • You may also want to require that "Our ref" occurs as the first set of letters on a line. And when you are at it, you can remove such newline character, followed by optional white space.

  • There is no need to match .*? at the very end of your pattern

  • As the text comes from PDF, you maybe don't want to be too strict about the number of spaces between words. You could use \s .

data_entity_sep_pattern = r'\n\s*(?=Our\s Ref.*?Name.*?Ref\s 1.*?Ref\s 2)'
split_on_data_entity = re.split(data_entity_sep_pattern, pdf_text)

for section in split_on_data_entity:
    print(section)
    print("--------------------------")
  • Related