How to extract ticker name from a pandas DataFrame?-CodePudding

I have following data which is converted in a pandas DataFrame (below lines are a direct copy paste since I do not know how to import it otherwise).

{17: '200 shares ExD 2022-09-21 PD 2022-09-30 dividend GAIN.NASDAQ 15.00 USD (0.075 per share) tax -2.25 USD (-15.000%) DivCntry US USIncmCode 06',
18: '101 shares ExD 2022-09-21 PD 2022-09-30 dividend LTC.NYSE 19.19 USD (0.19 per share) tax -2.88 USD (-15.000%) DivCntry US USIncmCode 06',    
19: '302 shares ExD 2022-09-29 PD 2022-10-12 dividend AGNC.NASDAQ 36.24 USD (0.12 per share) tax -5.44 USD (-15.000%) DivCntry US USIncmCode 06',     
20: '92 shares ExD 2022-07-07 PD 2022-08-22 dividend BTI.NYSE 60.31 USD (0.655523 per share) tax -0.00 USD (-0.0%) DivCntry GB fee amount -0.46 USD (0.005 per share)',     
21: '75 shares ExD 2022-09-14 PD 2022-10-11 dividend MO.NYSE 70.50 USD (0.94 per share) tax -10.58 USD (-15.000%) DivCntry US USIncmCode 06'}

I need a code to extract just the ticker name from this. My lines are below, but it collects the entire description again. Is there a way to code this so the result would only contain tickers (e.g. GAIN.NASDAQ, LTC.NYSE, AGNC.NASDAQ, BTI.NYSE, MO.NYSE)?

import pandas as pd
....
description = dividends[["Description"]]      # a frame dubbed "Description" with lines such as above                                    
ticker = description[description['Description'].str.contains('.NYSE')]
print(ticker)

CodePudding user response：

Simply use the pattern of description, if you do not have a specific list of tickers and split the string:

df['ticker'] = df['description'].str.split('dividend ').str[-1].str.split().str[0]

or use a regex instead

df['ticker'] = df['description'].str.extract(r'(\b[A-Z]\w \.[A-Z]\w )')

To extract a list of tickers:

df['description'].str.extract(r'(\b[A-Z]\w \.[A-Z]\w )')[0].tolist()
-> ['AGNC.NASDAQ', 'BTI.NYSE', 'GAIN.NASDAQ', 'LTC.NYSE', 'MO.NYSE']

To avoid duplicates use set()

set(df['description'].str.extract(r'(\b[A-Z]\w \.[A-Z]\w )')[0].tolist())
-> {'AGNC.NASDAQ', 'BTI.NYSE', 'GAIN.NASDAQ', 'LTC.NYSE', 'MO.NYSE'}

Example

This will create an additional column with ticker in your dataframe:

import pandas as pd

d = {17: '200 shares ExD 2022-09-21 PD 2022-09-30 dividend GAIN.NASDAQ 15.00 USD (0.075 per share) tax -2.25 USD (-15.000%) DivCntry US USIncmCode 06',
18: '101 shares ExD 2022-09-21 PD 2022-09-30 dividend LTC.NYSE 19.19 USD (0.19 per share) tax -2.88 USD (-15.000%) DivCntry US USIncmCode 06', 
19: '302 shares ExD 2022-09-29 PD 2022-10-12 dividend AGNC.NASDAQ 36.24 USD (0.12 per share) tax -5.44 USD (-15.000%) DivCntry US USIncmCode 06',
20: '92 shares ExD 2022-07-07 PD 2022-08-22 dividend BTI.NYSE 60.31 USD (0.655523 per share) tax -0.00 USD (-0.0%) DivCntry GB fee amount -0.46 USD (0.005 per share)', 
21: '75 shares ExD 2022-09-14 PD 2022-10-11 dividend MO.NYSE 70.50 USD (0.94 per share) tax -10.58 USD (-15.000%) DivCntry US USIncmCode 06'}

df = pd.DataFrame(d.values(), columns=['description'])
df['ticker'] = df['description'].str.extract(r'(\b[A-Z]\w \.[A-Z]\w )')

df[['ticker','description']]

Output

	ticker	description
0	GAIN.NASDAQ	200 shares ExD 2022-09-21 PD 2022-09-30 dividend GAIN.NASDAQ 15.00 USD (0.075 per share) tax -2.25 USD (-15.000%) DivCntry US USIncmCode 06
1	LTC.NYSE	101 shares ExD 2022-09-21 PD 2022-09-30 dividend LTC.NYSE 19.19 USD (0.19 per share) tax -2.88 USD (-15.000%) DivCntry US USIncmCode 06
2	AGNC.NASDAQ	302 shares ExD 2022-09-29 PD 2022-10-12 dividend AGNC.NASDAQ 36.24 USD (0.12 per share) tax -5.44 USD (-15.000%) DivCntry US USIncmCode 06
3	BTI.NYSE	92 shares ExD 2022-07-07 PD 2022-08-22 dividend BTI.NYSE 60.31 USD (0.655523 per share) tax -0.00 USD (-0.0%) DivCntry GB fee amount -0.46 USD (0.005 per share)
4	MO.NYSE	75 shares ExD 2022-09-14 PD 2022-10-11 dividend MO.NYSE 70.50 USD (0.94 per share) tax -10.58 USD (-15.000%) DivCntry US USIncmCode 06