Match company names and symbol with regex-CodePudding

I have the following sentences and need to extract the name of the company along with its symbol.

So far, I have tried this ([A-Z][a-z]*)(\s)([A-Z]{1,5}) but this is not matching when the name has multiple capital letter words (British Defence Industry Directory and Goldman Sachs) and when the first word of the company name is all capital letters (BDEC Limited).

Company British Defence Industry Directory BDEC sells stuff.
Company BDEC Limited BDEC sells stuff.
The company BDEC Limited BDEC sells stuff.
The company BDEC BDEC sells stuff.
The tech company Apple AAPL sells stuff.
The payments company Visa V sells stuff.
Customers are not happy with Goldman Sachs GS.

CodePudding user response：

I assume you need the names after the word "company" and before "sells" or after "with". This will do the trick.

import re
s='''
    Company British Defence Industry Directory BDEC sells stuff.
    Company BDEC Limited BDEC sells stuff.
    The company BDEC Limited BDEC sells stuff.
    The company BDEC BDEC sells stuff.
    The tech company Apple AAPL sells stuff.
    The payments company Visa V sells stuff.
    Customers are not happy with Goldman Sachs GS.
'''
pattern=r'(?i)company(.*?)sells|with(.*?)\.'
print(["".join(x) for x in re.findall(pattern,s)])

Output:

[' British Defence Industry Directory BDEC ', ' BDEC Limited BDEC ', ' BDEC Limited BDEC ', ' BDEC BDEC ', ' Apple AAPL ', ' Visa V ', ' Goldman Sachs GS']

CodePudding user response：

text = '''Company British Defence Industry Directory BDEC sells stuff.
Company BDEC Limited BDEC sells stuff.
The company BDEC Limited BDEC sells stuff.
The company BDEC BDEC sells stuff.
The tech company Apple AAPL sells stuff.
The payments company Visa V sells stuff.
Customers are not happy with Goldman Sachs GS.
'''

First removing the capitalized words at the start of the sentence, then removing the non capitalized words.

for l in text.splitlines():

    print([w for w in re.sub(r'^\w ',r'', l).split() if w[0].isupper()])

['British', 'Defence', 'Industry', 'Directory', 'BDEC']
['BDEC', 'Limited', 'BDEC']
['BDEC', 'Limited', 'BDEC']
['BDEC', 'BDEC']
['Apple', 'AAPL']
['Visa', 'V']
['Goldman', 'Sachs', 'GS.']

CodePudding user response：

Perhaps it will be enough to capture the first uppercase char, and after optionally matching words that start with an uppercase in between, make sure that the first captured uppercase char is the first char in the last part that consist of only uppercase chars.

\b([A-Z])\w*(?:\s[A-Z]\w*)*\s\1[A-Z]*\b

Regex demo

import re

pattern = r"\b([A-Z])\w*(?:\s[A-Z]\w*)*\s\1[A-Z]*\b"

s = ("Company British Defence Industry Directory BDEC sells stuff.\n"
            "Company BDEC Limited BDEC sells stuff.\n"
            "The company BDEC Limited BDEC sells stuff.\n"
            "The company BDEC BDEC sells stuff.\n"
            "The tech company Apple AAPL sells stuff.\n"
            "The payments company Visa V sells stuff.\n"
            "Customers are not happy with Goldman Sachs GS.\n\n")

matches = re.finditer(pattern, s)

for _, m in enumerate(matches, start=1):
    print(m.group(0))

Output

British Defence Industry Directory BDEC
BDEC Limited BDEC
BDEC Limited BDEC
BDEC BDEC
Apple AAPL
Visa V
Goldman Sachs GS