Extract a substring containing all capital letters in the end of a string-CodePudding

I have text looking like this:

Transfer to account IBAN NL10FRGS000000 FAKE COMPANY LTD

Transfer to account IBAN NL10FRGS000000 FAKE-COMPANY 22 LTD

Transfer #1234 received IBAN 00000 JOHN SMITH

I would like to extract the company name from the string. It is always in capital letters and is either LTD or CO but sometimes it can be a person's name, again written in Capital letters at the end of the string. The name of the company may contain '-'.

CodePudding user response：

You could try as follows:

import re

transfers = ['Transfer to account IBAN NL10FRGS000000 FAKE COMPANY LTD',
 'Transfer to account IBAN NL10FRGS000000 FAKE-COMPANY 22 LTD',
 'Transfer #1234 received IBAN 00000 JOHN SMITH']

pattern = r'[A-Z]{2}[0-9]{2}[A-Z0-9]{1,30}\s(.*$)'

# [A-Z]{2}[0-9]{2}[A-Z0-9]{1,30} will get any IBAN-like string, 
# it's not necessarily a valid IBAN.

company_list = list()

for t in transfers:
    m = re.search(pattern, t)
    if m != None:
        company = m.group(1)
        company_list.append(company)
        
        # note that m.group(0).split(maxsplit=1) will get you the IBAN as well
        # e.g.: iban, company = m.group(0).split(maxsplit=1)
        # print(iban, company): NL10FRGS000000 FAKE COMPANY LTD
        
company_list
['FAKE COMPANY LTD', 'FAKE-COMPANY 22 LTD']

Note that the last entry doesn't return a match, since 00000 does not match the IBAN pattern.

Update: "Since these transfers are in a pandas column is it possible to be done without for loop?" Yes, can be done. No need to import re in this case.

import pandas as pd

transfers = ['Transfer to account IBAN NL10FRGS000000 FAKE COMPANY LTD',
 'Transfer to account IBAN NL10FRGS000000 FAKE-COMPANY 22 LTD',
 'Transfer #1234 received IBAN 00000 JOHN SMITH']

df = pd.DataFrame(transfers, columns=['Transfers'])

pattern = r'[A-Z]{2}[0-9]{2}[A-Z0-9]{1,30}\s(.*$)'

df['Company'] = df.Transfers.str.extract(pattern)

print(df['Company'])

0       FAKE COMPANY LTD
1    FAKE-COMPANY 22 LTD
2                    NaN
Name: Company, dtype: object

Or together with the IBAN:

df = pd.DataFrame(transfers, columns=['Transfers'])

# N.B. two capturing groups here in pattern
pattern = r'([A-Z]{2}[0-9]{2}[A-Z0-9]{1,30})\s(.*$)'

df[['IBAN', 'Company']] = df.Transfers.str.extract(pattern)

print(df[['IBAN', 'Company']])

             IBAN              Company
0  NL10FRGS000000     FAKE COMPANY LTD
1  NL10FRGS000000  FAKE-COMPANY 22 LTD
2             NaN                  NaN

CodePudding user response：

transfers = ['Transfer to account IBAN NL10FRGS000000 FAKE COMPANY LTD',
         'Transfer to account IBAN NL10FRGS000000 FAKE-COMPANY 22 LTD',
         'Transfer #1234 received IBAN 00000 JOHN SMITH']
company = []
Iban = []
for i in transfers:
    if 'IBAN NL' in i:
       iban_data = i.partition('IBAN')[2].strip().partition(' ')[0]# assuming company name separated by space.
       company_data = i.split(iban_data)[1].strip()
       Iban.append(iban_data)
       company.append(company_data)
print(Iban)
print(company)
>> ['NL10FRGS000000', 'NL10FRGS000000']
>> ['FAKE COMPANY LTD', 'FAKE-COMPANY 22 LTD']