Home > Software engineering >  How to select and store string values in pandas using regex
How to select and store string values in pandas using regex

Time:07-20

I want to select and store string values(regex) into variable. Here I want to extract USD, 1000.00, 21/01/2022 from Name and store into separate variables for amount = 1000.0, date = 21/01/2022 and currency = USD respectively.

import pandas as pd

data_dict = {
            'Name': ['I paid USD 1000.00 on 21/01/2022 for car service']
          }
df = pd.DataFrame(data_dict)

CodePudding user response:

If the order is consistent and the currency always right before the amount, a simple regex with str.extract should work:

df['Name'].str.extract(r'(?P<currency>[A-Z]{3})\s*(?P<amount>\d (?:\.\d )).*(?P<date>\d\d/\d\d/\d{4})')

output:

  currency   amount        date
0      USD  1000.00  21/01/2022
Notes
  • This won't validate the date (99/00/9876 would match)
  • This assumes the currency is 3 uppercase letters (you can easily adapt)
  • The type of all columns is string, if you plan to use the amount or date for operations you must convert them

variant: numbers with decimal separator

out = df['Name'].str.extract(r'(?P<currency>[A-Z]{3})\s*(?P<amount>\d (?:,\d{3})*(?:\.\d )).*(?P<date>\d\d/\d\d/\d{4})')

example:

  currency      amount        date
0      USD     1000.00  21/01/2022
1      USD  100,000.00  21/01/2022
  • Related