Python Beautifulsoup scraping script unpacking, hardcoding and duplication-CodePudding

I'm practising some Python scraping and I'm a bit stuck with the following exercise. The aim is to scrape the tickers resulting when applying some filters. Code below:

tickers = []
counter = 1

while True:
    url = ("https://finviz.com/screener.ashx?v=111&f=cap_large&r="  str(counter))
    req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
    webpage = urlopen(req).read()
    html = soup(webpage, "html.parser")

    rows = html.select('table[bgcolor="#d3d3d3"] tr')
    for i in rows[1:]:
        a1, a2, a3, a4 = (x.text for x in i.find_all('td')[1:5])
        i = a1
        tickers.append(i)
    counter =20
    if tickers[-1]==tickers[-2]:
        break

I'm not sure how to extract only 1 column so I'm using the code for all them (a1, a2, a3, a4 = (x.text for x in i.find_all('td')[1:5])), is there a way just to get the first column?

Is there a way to avoid having to hardcode '20' in the script?

When I run the code it creates a duplicate of the last ticker, is there another way to make the code stop when it went through all the entries?

CodePudding user response：

You can use nth-child range to filter out first row in table, then nth-child(2) to get the tickers column within the remaining table rows

tickers = [td.text for td in html.select('table[bgcolor="#d3d3d3"] tr:nth-child(n 2) td:nth-child(2)')]

With an existing list use

tickers.extend([td.text for td in html.select('table[bgcolor="#d3d3d3"] tr:nth-child(n 2) td:nth-child(2)')])

Read about nth-child here:

http://nthmaster.com/

and

https://developer.mozilla.org/en-US/docs/Web/CSS/:nth-child

You can stop when there is no more "next" present. counter needs to increment by 20 each request.

import requests
from bs4 import BeautifulSoup as bs

tickers = []
counter = 1

with requests.Session() as s:
    s.headers = {'User-Agent':'Mozilla/5.0'}
    while True:
        # print(counter)
        url = ("https://finviz.com/screener.ashx?v=111&f=cap_large&r="  str(counter))
        res = s.get(url)
        html = bs(res.text, "html.parser")
        tickers.extend([td.text for td in html.select('table[bgcolor="#d3d3d3"] tr:nth-child(n 2) td:nth-child(2)')])
        
        if html.select_one('.tab-link b:-soup-contains("next")') is None:
            break
        counter =20

CodePudding user response：

So you are only interested in the values of tickers column, select it more specific - Based on its content the <a>:

html.select('table[bgcolor="#d3d3d3"] a.screener-link-primary')

To avoid working with the hardcoded 20 just take a look if there is a next page element and use its href:

html.select_one('.tab-link:-soup-contains("next")')

Example

import requests,time
from bs4 import BeautifulSoup

url = "https://finviz.com/screener.ashx?v=111&f=cap_large"
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36','accept': 'text/html,application/xhtml xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9'}
tickers = []

while True:
    r = requests.get(url, headers=headers)
    html = BeautifulSoup(r.text, "html.parser")

    for a in html.select('table[bgcolor="#d3d3d3"] a.screener-link-primary'):
        tickers.append(a.text)

    if (url := html.select_one('.tab-link:-soup-contains("next")')):
        url = "https://finviz.com/" url['href']
    else:
        break
    # be kind and add some delay between your requests
    time.sleep(1)

tickers