I'm practising some Python scraping and I'm a bit stuck with the following exercise. The aim is to scrape the tickers resulting when applying some filters. Code below:
tickers = []
counter = 1
while True:
url = ("https://finviz.com/screener.ashx?v=111&f=cap_large&r=" str(counter))
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
html = soup(webpage, "html.parser")
rows = html.select('table[bgcolor="#d3d3d3"] tr')
for i in rows[1:]:
a1, a2, a3, a4 = (x.text for x in i.find_all('td')[1:5])
i = a1
tickers.append(i)
counter =20
if tickers[-1]==tickers[-2]:
break
I'm not sure how to extract only 1 column so I'm using the code for all them (a1, a2, a3, a4 = (x.text for x in i.find_all('td')[1:5]))
, is there a way just to get the first column?
Is there a way to avoid having to hardcode '20' in the script?
When I run the code it creates a duplicate of the last ticker, is there another way to make the code stop when it went through all the entries?
CodePudding user response:
You can use nth-child range to filter out first row in table, then nth-child(2) to get the tickers column within the remaining table rows
tickers = [td.text for td in html.select('table[bgcolor="#d3d3d3"] tr:nth-child(n 2) td:nth-child(2)')]
With an existing list use
tickers.extend([td.text for td in html.select('table[bgcolor="#d3d3d3"] tr:nth-child(n 2) td:nth-child(2)')])
Read about nth-child here:
and
https://developer.mozilla.org/en-US/docs/Web/CSS/:nth-child
You can stop when there is no more "next" present. counter needs to increment by 20 each request.
import requests
from bs4 import BeautifulSoup as bs
tickers = []
counter = 1
with requests.Session() as s:
s.headers = {'User-Agent':'Mozilla/5.0'}
while True:
# print(counter)
url = ("https://finviz.com/screener.ashx?v=111&f=cap_large&r=" str(counter))
res = s.get(url)
html = bs(res.text, "html.parser")
tickers.extend([td.text for td in html.select('table[bgcolor="#d3d3d3"] tr:nth-child(n 2) td:nth-child(2)')])
if html.select_one('.tab-link b:-soup-contains("next")') is None:
break
counter =20
CodePudding user response:
So you are only interested in the values of tickers column, select it more specific - Based on its content the <a>
:
html.select('table[bgcolor="#d3d3d3"] a.screener-link-primary')
To avoid working with the hardcoded 20
just take a look if there is a next page element and use its href
:
html.select_one('.tab-link:-soup-contains("next")')
Example
import requests,time
from bs4 import BeautifulSoup
url = "https://finviz.com/screener.ashx?v=111&f=cap_large"
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36','accept': 'text/html,application/xhtml xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9'}
tickers = []
while True:
r = requests.get(url, headers=headers)
html = BeautifulSoup(r.text, "html.parser")
for a in html.select('table[bgcolor="#d3d3d3"] a.screener-link-primary'):
tickers.append(a.text)
if (url := html.select_one('.tab-link:-soup-contains("next")')):
url = "https://finviz.com/" url['href']
else:
break
# be kind and add some delay between your requests
time.sleep(1)
tickers