How to scrape table data that doesnt have different classes?-CodePudding

Im trying to write some code that will scrape different data from a table on a stock screener website and save the data in excel. The problem I'm having is there isn't a distinct class code for some of the values I want to pull from the table. so I tried this only for the first header I wanted the ticker but it pulls all of the tab-links on the page. any help would be appreciated?

from bs4 import BeautifulSoup
import requests
import pandas as pd

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:94.0) Gecko/20100101 Firefox/94.0'}
df_headers = ['Ticker' , 'Owner' , 'Relationshiop' , 'Date' ,'Transaction' , 'Total Shares' , 'SEC Form']
url= "https://finviz.com/insidertrading.ashx"
r = requests.get(url, headers=headers)

soup = BeautifulSoup(r.content, 'lxml')

Ticker = [item.text for item in soup.select('.tab-link:nth-of-type(1):not([id])')]
print(Ticker)

I also tried this code Ticker = [item.text for item in soup.select('.insider-buy-row-2 .tab-link')] and it did pull the ticker I wanted but it also included the persons name and other rows.

CodePudding user response：

Use combination of pandas and BeautifulSoup -

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:94.0) Gecko/20100101 Firefox/94.0'}
df_headers = ['Ticker' , 'Owner' , 'Relationshiop' , 'Date' ,'Transaction' , 'Total Shares' , 'SEC Form']
url= "https://finviz.com/insidertrading.ashx"
r = requests.get(url, headers=headers)

soup = BeautifulSoup(r.content, 'lxml')

tbl = soup.findAll("table")
tbls = pd.read_html(str(tbl))
df = tbls[4]
df, df.columns = df[1:] , df.iloc[0]

Important part here is pd.read_html can read multiple dataframes from <table> tags. You just have to grab the right table from the output and set the header properly.