I have text looking like this:
Transfer to account IBAN NL10FRGS000000 FAKE COMPANY LTD
or
Transfer to account IBAN NL10FRGS000000 FAKE-COMPANY 22 LTD
or
Transfer #1234 received IBAN 00000 JOHN SMITH
I would like to extract the company name from the string. It is always in capital letters and is either LTD or CO but sometimes it can be a person's name, again written in Capital letters at the end of the string. The name of the company may contain '-'.
CodePudding user response:
You could try as follows:
import re
transfers = ['Transfer to account IBAN NL10FRGS000000 FAKE COMPANY LTD',
'Transfer to account IBAN NL10FRGS000000 FAKE-COMPANY 22 LTD',
'Transfer #1234 received IBAN 00000 JOHN SMITH']
pattern = r'[A-Z]{2}[0-9]{2}[A-Z0-9]{1,30}\s(.*$)'
# [A-Z]{2}[0-9]{2}[A-Z0-9]{1,30} will get any IBAN-like string,
# it's not necessarily a valid IBAN.
company_list = list()
for t in transfers:
m = re.search(pattern, t)
if m != None:
company = m.group(1)
company_list.append(company)
# note that m.group(0).split(maxsplit=1) will get you the IBAN as well
# e.g.: iban, company = m.group(0).split(maxsplit=1)
# print(iban, company): NL10FRGS000000 FAKE COMPANY LTD
company_list
['FAKE COMPANY LTD', 'FAKE-COMPANY 22 LTD']
Note that the last entry doesn't return a match, since 00000
does not match the IBAN pattern.
Update: "Since these transfers are in a pandas column is it possible to be done without for loop?" Yes, can be done. No need to import re
in this case.
import pandas as pd
transfers = ['Transfer to account IBAN NL10FRGS000000 FAKE COMPANY LTD',
'Transfer to account IBAN NL10FRGS000000 FAKE-COMPANY 22 LTD',
'Transfer #1234 received IBAN 00000 JOHN SMITH']
df = pd.DataFrame(transfers, columns=['Transfers'])
pattern = r'[A-Z]{2}[0-9]{2}[A-Z0-9]{1,30}\s(.*$)'
df['Company'] = df.Transfers.str.extract(pattern)
print(df['Company'])
0 FAKE COMPANY LTD
1 FAKE-COMPANY 22 LTD
2 NaN
Name: Company, dtype: object
Or together with the IBAN:
df = pd.DataFrame(transfers, columns=['Transfers'])
# N.B. two capturing groups here in pattern
pattern = r'([A-Z]{2}[0-9]{2}[A-Z0-9]{1,30})\s(.*$)'
df[['IBAN', 'Company']] = df.Transfers.str.extract(pattern)
print(df[['IBAN', 'Company']])
IBAN Company
0 NL10FRGS000000 FAKE COMPANY LTD
1 NL10FRGS000000 FAKE-COMPANY 22 LTD
2 NaN NaN
CodePudding user response:
transfers = ['Transfer to account IBAN NL10FRGS000000 FAKE COMPANY LTD',
'Transfer to account IBAN NL10FRGS000000 FAKE-COMPANY 22 LTD',
'Transfer #1234 received IBAN 00000 JOHN SMITH']
company = []
Iban = []
for i in transfers:
if 'IBAN NL' in i:
iban_data = i.partition('IBAN')[2].strip().partition(' ')[0]# assuming company name separated by space.
company_data = i.split(iban_data)[1].strip()
Iban.append(iban_data)
company.append(company_data)
print(Iban)
print(company)
>> ['NL10FRGS000000', 'NL10FRGS000000']
>> ['FAKE COMPANY LTD', 'FAKE-COMPANY 22 LTD']