I have following data which is converted in a pandas DataFrame (below lines are a direct copy paste since I do not know how to import it otherwise).
{17: '200 shares ExD 2022-09-21 PD 2022-09-30 dividend GAIN.NASDAQ 15.00 USD (0.075 per share) tax -2.25 USD (-15.000%) DivCntry US USIncmCode 06',
18: '101 shares ExD 2022-09-21 PD 2022-09-30 dividend LTC.NYSE 19.19 USD (0.19 per share) tax -2.88 USD (-15.000%) DivCntry US USIncmCode 06',
19: '302 shares ExD 2022-09-29 PD 2022-10-12 dividend AGNC.NASDAQ 36.24 USD (0.12 per share) tax -5.44 USD (-15.000%) DivCntry US USIncmCode 06',
20: '92 shares ExD 2022-07-07 PD 2022-08-22 dividend BTI.NYSE 60.31 USD (0.655523 per share) tax -0.00 USD (-0.0%) DivCntry GB fee amount -0.46 USD (0.005 per share)',
21: '75 shares ExD 2022-09-14 PD 2022-10-11 dividend MO.NYSE 70.50 USD (0.94 per share) tax -10.58 USD (-15.000%) DivCntry US USIncmCode 06'}
I need a code to extract just the ticker name from this. My lines are below, but it collects the entire description again. Is there a way to code this so the result would only contain tickers (e.g. GAIN.NASDAQ, LTC.NYSE, AGNC.NASDAQ, BTI.NYSE, MO.NYSE
)?
import pandas as pd
....
description = dividends[["Description"]] # a frame dubbed "Description" with lines such as above
ticker = description[description['Description'].str.contains('.NYSE')]
print(ticker)
CodePudding user response:
Simply use the pattern of description, if you do not have a specific list of tickers and split
the string:
df['ticker'] = df['description'].str.split('dividend ').str[-1].str.split().str[0]
or use a regex
instead
df['ticker'] = df['description'].str.extract(r'(\b[A-Z]\w \.[A-Z]\w )')
To extract a list
of tickers:
df['description'].str.extract(r'(\b[A-Z]\w \.[A-Z]\w )')[0].tolist()
-> ['AGNC.NASDAQ', 'BTI.NYSE', 'GAIN.NASDAQ', 'LTC.NYSE', 'MO.NYSE']
To avoid duplicates use set()
set(df['description'].str.extract(r'(\b[A-Z]\w \.[A-Z]\w )')[0].tolist())
-> {'AGNC.NASDAQ', 'BTI.NYSE', 'GAIN.NASDAQ', 'LTC.NYSE', 'MO.NYSE'}
Example
This will create an additional column with ticker in your dataframe:
import pandas as pd
d = {17: '200 shares ExD 2022-09-21 PD 2022-09-30 dividend GAIN.NASDAQ 15.00 USD (0.075 per share) tax -2.25 USD (-15.000%) DivCntry US USIncmCode 06',
18: '101 shares ExD 2022-09-21 PD 2022-09-30 dividend LTC.NYSE 19.19 USD (0.19 per share) tax -2.88 USD (-15.000%) DivCntry US USIncmCode 06',
19: '302 shares ExD 2022-09-29 PD 2022-10-12 dividend AGNC.NASDAQ 36.24 USD (0.12 per share) tax -5.44 USD (-15.000%) DivCntry US USIncmCode 06',
20: '92 shares ExD 2022-07-07 PD 2022-08-22 dividend BTI.NYSE 60.31 USD (0.655523 per share) tax -0.00 USD (-0.0%) DivCntry GB fee amount -0.46 USD (0.005 per share)',
21: '75 shares ExD 2022-09-14 PD 2022-10-11 dividend MO.NYSE 70.50 USD (0.94 per share) tax -10.58 USD (-15.000%) DivCntry US USIncmCode 06'}
df = pd.DataFrame(d.values(), columns=['description'])
df['ticker'] = df['description'].str.extract(r'(\b[A-Z]\w \.[A-Z]\w )')
df[['ticker','description']]
Output
ticker | description | |
---|---|---|
0 | GAIN.NASDAQ | 200 shares ExD 2022-09-21 PD 2022-09-30 dividend GAIN.NASDAQ 15.00 USD (0.075 per share) tax -2.25 USD (-15.000%) DivCntry US USIncmCode 06 |
1 | LTC.NYSE | 101 shares ExD 2022-09-21 PD 2022-09-30 dividend LTC.NYSE 19.19 USD (0.19 per share) tax -2.88 USD (-15.000%) DivCntry US USIncmCode 06 |
2 | AGNC.NASDAQ | 302 shares ExD 2022-09-29 PD 2022-10-12 dividend AGNC.NASDAQ 36.24 USD (0.12 per share) tax -5.44 USD (-15.000%) DivCntry US USIncmCode 06 |
3 | BTI.NYSE | 92 shares ExD 2022-07-07 PD 2022-08-22 dividend BTI.NYSE 60.31 USD (0.655523 per share) tax -0.00 USD (-0.0%) DivCntry GB fee amount -0.46 USD (0.005 per share) |
4 | MO.NYSE | 75 shares ExD 2022-09-14 PD 2022-10-11 dividend MO.NYSE 70.50 USD (0.94 per share) tax -10.58 USD (-15.000%) DivCntry US USIncmCode 06 |