Home > Mobile >  Split the text between the start and end words from a long String using Regex Python
Split the text between the start and end words from a long String using Regex Python

Time:01-31

Extracted a threaded email body using pywin32 and I need to extract the signature part alone from the body. I tried to split the text using the signature start as starting word and the next email From as the Ending word. For ex: 'With Regards' will be the starting word and 'Von' will be the ending word.

Body of the email:

Dear Sir,
Your Order is ready and tested. It will be shipped shortly.
Let me know once you receive it.

With Regards,
dcabv,
vce technologies
vce.com
cont: 00440044


From: [email protected]
To:[email protected]
Sub: product

Dear Sir,
Can I get the update of my order?

With Regards,
abc
cont: 46346466

Von: [email protected]
Gesendet:[email protected]
Sub: Order Placed

Dear Sir,
your order has placed. 
you will receive it shortly.

With Regards,
dcabv,
vce technologies
vce.com
cont: 00440044

Another Text:

a = """
Best regards,

 

i.V. Cap. Mars Wel

Chief Superintendent

========================== 

P E N T   H O L E

Sahrts-HC

Elba 379

D- 259 Ham

Tel:     58 58 58585-584    

Mobile:  91 758 858 5875 

Fax:     47 47 85885-855

Sitz: Ham, HA 5772


 

 

 

Von: Korayae Vinay <[email protected]> 
Gesendet: Donnes, 19. Januar 2014 12:16
An: Wel, Mars <[email protected]>
Betreff: RE: Prod Order

 

Dear Donnes;

Good day

 
A few minutes before ı placed the order.

 
If you need any assistance we are happy to help with that.

 

 

Best Regards

 

Korayae Vinay

Managing Director
"""

Any suggestions?

The code I tried is below:

re.findall('(?:(?:With best regards,|Best Regards,)\s*.*(?:Von:|From:))', body, flags = re.IGNORECASE|re.DOTALL|re.MULTILINE)
Note: body is the body of the extracted email.

The Output I received:

With Regards,
dcabv,
vce technologies
vce.com
cont: 00440044


From: [email protected]
To:[email protected]
Sub: product

Dear Sir,
Can I get the update of my order?

With Regards,
abc
cont: 46346466

Von:

But I want the output to be splitted into two as 1st output will be from 'With Regards' of 1st email til 'From'.

2nd output will be from 'With Regards' of 2nd email til 'Von'.

CodePudding user response:

Assuming that all the lines below signature line don't contain an empty line, you may use:

(?mi)^with(?:\s best)?\s regards.*(?:\n. ) 

RegEx Demo

RegEx Breakdown:

  • (?mi): Ignore Case and Multi line Mode
  • ^: Start
  • with: Match test with
  • (?:\s best)?: Optionally match 1 spaces followed by best
  • \s regards: match 1 spaces followed by regards
  • .*: Match everything else till end of line
  • (?:\n. ) : Match 1 of any characters followed by line break. Repeat this group 1 times

Update:

Based on your edited question, you may use this regex:

(?mi)^(?:with\s )?(?:best\s )?regards.*(?:\n(?!von:|from:).*) 

RegEx Demo 2

Here (?!von:|from:) is a negative lookahead to stop the match when we encounter von: or from: on next line.

  • Related