Home > database >  Using RegEx in Python to extract contents
Using RegEx in Python to extract contents

Time:02-21

Good evening,

I am very new to Python and RegEx. I have the following sentence:

-75.76 Card INSURANCEGrabPay ASIA DIRECT to Paid AM 1:16  100.00 3257 UpAmex Top PM 9:55  300.00 3257 UpAmex Top PM 9:55 -400.00 Card LTDGrabPay PTE AXS to Paid PM 9:57 (SGD) Amount Details Time here. appear will transactions cashless your All 2022 Feb 15 on made transactions GrabPay points 52 earned points Rewards 475.76 SGD spent Amount 0.24 SGD balance Wallet 2022 Feb 15 Summary statement daily your here

I would like to search for just '-' and the amount after that.

After that, I would like to skip 2 words and extract ALL words if need be in a single group (I will read more about groups but for now i would need in a single group, which i can later use to split and get the words from that string) just before 'Paid'

For instance, I would get

-75.76 ASIA Direct to
-400 PTE AXS to

What would be the regex command? Also, is there a good regex tutorial where I can read up on?

CodePudding user response:

For now I have created one match having 2 groups ie, group1 for the amount and group2 for all the words (that include "to " string also).

Regex:

(-\d \.?\d ) \w  \w  ([\w ] )?Paid

You can check the details here: https://regex101.com/r/eUMgdW/1

Python code:

import re
output = re.findall("""(-\d \.?\d ) \w  \w  ([\w ] )?Paid""", your_input_string)

for found in output:
    print(found)

#('-75.76', 'ASIA DIRECT to ')
#('-400.00', 'PTE AXS to ')

CodePudding user response:

Rather than give you the actual regex, I'll gently nudge you in the right direction. It's more satisfying that way.

"Words" here are seperated by spaces. So what you're searching for is a group of characters (captured), a space, characters again, space, characters, space, then capture everything and end with "PAID". Try to create a regex to do that.

If you'd like to brush up on regex, check out Regex101. It's a web tool to test out regex, along with a debugger and a cheat sheet.

CodePudding user response:

I agree with the first commenter, but I'm providing my take on it since this is actually a bit of a more complicated instance where you're trying to avoid words and then capture a second group. This is my on-the-fly solution that works:

import re

your_string = "-75.76 Card INSURANCEGrabPay ASIA DIRECT to Paid AM 1:16  100.00 3257 UpAmex Top PM 9:55  300.00 3257 UpAmex Top PM 9:55 -400.00 Card LTDGrabPay PTE AXS to Paid PM 9:57 (SGD) Amount Details Time here. appear will transactions cashless your All 2022 Feb 15 on made transactions GrabPay points 52 earned points Rewards 475.76 SGD spent Amount 0.24 SGD balance Wallet 2022 Feb 15 Summary statement daily your here"

# split the string on "Paid" and then strip whitespace
paid_split_string = [substring.strip() for substring in your_string.split("Paid")]

# find payment amounts   last three words before "Paid"
captures = [re.findall(r'-\d.*', substring)[0] for substring in paid_split_string if re.findall(r'-\d.*', substring)]
for capture in captures:
  print(capture.split()[0]   " "   " ".join(capture.split()[3:]))

RegExr is an amazing interactive website where you can test out patterns and then also see what each special character does. I cannot recommend that website enough, please bookmark it for your reference. If you have LinkedIn Learning, I highly recommend this course as well.

  • Related