Replace span tags with whitespace or parse contents as new column with pandas.read

I want to scrape Congressional stock trades from

Here is where I found the span tag:

Company tickers are 1 to 5 characters in length, and I have failed to regex tickers because there are many varieties of company suffixes (e.g., "INC", "CORP", "PLC", "SE", etc.), and not all company names have suffixes.

How can I either replace span tags with whitespace to separate company names and tickers or parse the span as another column?

Here is my code:


import pandas as pd
import yfinance as yf
from selenium import webdriver
from bs4 import BeautifulSoup
import time
import datetime

def get_url(page=1, pageSize=50, assetType='stock'):
    if page == 1:
        return f'https://www.capitoltrades.com/trades?assetType={assetType}&pageSize={pageSize}'
    elif page > 1:
        return f'https://www.capitoltrades.com/trades?assetType={assetType}&page={page}&pageSize={pageSize}'
    else:
        return None

driver = webdriver.Firefox()
driver.get(get_url(page=1))
driver.implicitly_wait(10)
time.sleep(1)
soup = BeautifulSoup(driver.page_source, 'lxml')
tables = soup.find_all('table')
table = pd.read_html(str(tables))[0]
driver.close()

CodePudding user response：

To separate company names and tickers or parse the span as another column aka to get overall neat and clean ResultSet, you can change your tool selection strategy a bit. In this case, It would be better to apply bs4 with pandas DataFrame instead of pd.read_html() method.

Full working code as an example:

import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
import time

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
#The base url: https://www.capitoltrades.com/trades?assetType=stock&pageSize=50
data = []
for page in range(1, 5):
    driver.get(f'https://www.capitoltrades.com/trades?assetType=stock&pageSize=50&page={page}')
    driver.maximize_window()
    time.sleep(3)
    soup = BeautifulSoup(driver.page_source, 'lxml')
    for row in soup.select('table.q-table.trades-table > tbody tr'):
        Politician = row.select_one('[] > a').text.strip()
        Politician_info = row.select_one('[]').get_text(' ',strip=True)
        Traded_Issuer = row.select_one('[] > a').text.strip()
        Issuer_ticker =row.select_one('span[]').text.strip()
        Published = row.select_one('[] .q-value').text.strip()
        Traded = row.select_one('[] .q-value')
        Traded = Traded.text.strip() if Traded else None
        Filed_after = row.select_one('[] .q-value').text.strip()
        Owner =row.select_one('[] span').text.strip()
        _type = row.select_one('[]').get_text(strip=True)
        Size = row.select_one('[] > div').get_text(strip=True)
        #Size = Size.text.strip() if Size else None
        Price =row.select_one('[]').text.strip()

        data.append({
            'Politician':Politician,
            'Politician_info':Politician_info,
            'Traded_Issuer':Traded_Issuer,
            'Issuer_ticker':Issuer_ticker,
            'Published':Published,
            'Traded':Traded,
            'Filed_after':Filed_after,
            'Owner':Owner,
            'Type':_type,
            'Size':Size,
            'Price':Price

        })


df = pd.DataFrame(data)
print(df)

Output:

                   Politician      Politician_info                 Traded_Issuer Issuer_ticker  ...        Owner  Type      Size   Price
0    Debbie Wasserman Schultz    Democrat House FL  ADAMS RESOURCES & ENERGY INC         AE:US  ...        Child  sell  1K - 15K   32.27     
1               Kathy Manning    Democrat House NC                         3M Co        MMM:US  ...       Spouse   buy  1K - 15K  108.95     
2               Kathy Manning    Democrat House NC                 Accenture PLC        ACN:US  ...       Spouse   buy  1K - 15K  250.07     
3               Kathy Manning    Democrat House NC                     Adobe Inc       ADBE:US  ...       Spouse  sell  1K - 15K  286.15     
4               Kathy Manning    Democrat House NC                  Alphabet Inc      GOOGL:US  ...       Spouse   buy  1K - 15K   97.56     
..                        ...                  ...                           ...           ...  ...          ...   ...       ...     ...       
195         Diana Harshbarger  Republican House TN                 CME Group Inc        CME:US  ...        Joint  sell  1K - 15K  176.26       
196         Diana Harshbarger  Republican House TN                 CME Group Inc        CME:US  ...       Spouse  sell  1K - 15K  176.26       
197         Diana Harshbarger  Republican House TN            The Home Depot Inc         HD:US  ...  Undisclosed  sell  1K - 15K  268.69       
198         Diana Harshbarger  Republican House TN            The Home Depot Inc         HD:US  ...  Undisclosed  sell  1K - 15K  268.69       
199         Diana Harshbarger  Republican House TN            The Home Depot Inc         HD:US  ...        Joint  sell  1K - 15K  268.69       

[200 rows x 11 columns]

CodePudding user response：

Btw, _{you don't need to go through those steps with soup and find_all - read_html already uses the same [or at least very similar] parsers as bs4 so you could just have}

table = pd.read_html(driver.page_source)[0] # , flavor='lxml')[0] # default

_{or even directly pd.read_html(get_url(page=1)) - it doesn't always work for all pages, but it seems to work for your page; it won't allow for the solutions I'm about to suggest, but I thought it was worth mentioning this convenience in case you didn't already know.}

Anyway, my first solution is going to seem a bit hacky, but it has worked for me so far - since, read_html doesn't seem to have any string-separator argument like bs4's get_text does, we can try to create the same effect by just adding spaces to the end of every tag in the html using .replace.

table = pd.read_html(driver.page_source.replace('</', '&nbsp;</'))

If you are wary of messing with text-replace on html strings, or are worried about adding unwanted spaces (even though it doesn't, as far I have seen), or just want more control of how the text is extracted from table cells, you can also convert the bs4 table tag to a python dictionary using get_text [with space (' ') separator] and list comprehension.

tableSoup = soup.find('table')
tableHeaders = [
    h.get_text(' ', strip=True)
    for h in tableSoup.select('thead tr th')
]
tableData = [{h: c.get_text(' ', strip=True) for h, c in zip(
    tableHeaders, [cell for cell in r.select('td')]
)} for r in tableSoup.select('tbody tr')]
table = pd.DataFrame(tableData)

You could even replace get_text with your own function if you want to customize how cell values are extracted and/or processed, but as it is, it [and the first method with .replace] creates a DataFrame like:

Politician	Traded Issuer	Published	Traded	Filed after	Owner	Type	Size	Price
Pete Sessions Republican House TX	US TREASURY NOTE N/A	yesterday 14:05	2022 9 Nov	days 1	Undisclosed	buy	15K - 50K	N/A
Pete Sessions Republican House TX	US TREASURY NOTE N/A	yesterday 14:05	2022 7 Nov	days 3	Undisclosed	buy	50K - 100K	N/A
Pete Sessions Republican House TX	US TREASURY NOTE N/A	yesterday 14:05	2022 7 Nov	days 3	Undisclosed	buy	1K - 15K	N/A
Brad Sherman Democrat House CA	US TREASURY NOTE N/A	yesterday 14:05	2022 3 Nov	days 7	Undisclosed	sell	250K - 500K	N/A
Brad Sherman Democrat House CA	US TREASURY NOTE N/A	yesterday 14:05	2022 3 Nov	days 7	Undisclosed	sell	500K - 1M	N/A
Brad Sherman Democrat House CA	US TREASURY NOTE N/A	yesterday 14:05	2022 3 Nov	days 7	Undisclosed	sell	250K - 500K	N/A
Brad Sherman Democrat House CA	US TREASURY NOTE N/A	yesterday 14:05	2022 7 Nov	days 3	Undisclosed	sell	250K - 500K	N/A
Brad Sherman Democrat House CA	US TREASURY NOTE N/A	yesterday 14:05	2022 3 Nov	days 7	Undisclosed	sell	250K - 500K	N/A
Brad Sherman Democrat House CA	US TREASURY NOTE N/A	yesterday 14:05	2022 3 Nov	days 7	Undisclosed	sell	15K - 50K	N/A
Debbie Wasserman Schultz Democrat House FL	ADAMS RESOURCES & ENERGY INC AE:US	yesterday 14:05	2022 27 Oct	days 14	Child	sell	1K - 15K	32.27
Kathy Manning Democrat House NC	3M Co MMM:US	yesterday 14:05	2022 12 Oct	days 29	Spouse	buy	1K - 15K	108.95
Kathy Manning Democrat House NC	Accenture PLC ACN:US	yesterday 14:05	2022 12 Oct	days 29	Spouse	buy	1K - 15K	250.07
Kathy Manning Democrat House NC	Adobe Inc ADBE:US	yesterday 14:05	2022 12 Oct	days 29	Spouse	sell	1K - 15K	286.15
Kathy Manning Democrat House NC	Alphabet Inc GOOGL:US	yesterday 14:05	2022 12 Oct	days 29	Spouse	buy	1K - 15K	97.56
Kathy Manning Democrat House NC	Amazon.com Inc AMZN:US	yesterday 14:05	2022 12 Oct	days 29	Spouse	buy	1K - 15K	112.9

[Markdown for the table above was copied from the output of print(table.drop([''], 'columns').to_markdown(index=False)).]