Home > Software engineering >  Fix for missing 'tr' class in webscraping
Fix for missing 'tr' class in webscraping

Time:12-24

I'm trying to webscrape different stocks by rows, with the data scraped from https://www.slickcharts.com/sp500. I am following a tutorial using a similar website, however that website uses classes for each of its rows, while mine doesn't (attached below).

Website used in tutorial

Website desired

This is the code I'm trying to use, however I don't get any output whatsoever. I'm still pretty new at coding so any feedback is welcome.

import requests
import pandas as pd
from bs4 import BeautifulSoup

company = []
symbol = []

url = 'https://www.slickcharts.com/sp500' #Data from SlickCharts
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
rows = soup.find_all('tr')

for i in rows:
    row = i.find_all('td')
    print(row[0])

CodePudding user response:

First of all, you need to add some headers to your request because most likely you get the same as me: status code 403 Forbidden. It's because the website is blocking your request. Adding User-Agent does the trick:

headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'
}

page = requests.get(url, headers=headers)

Then you can iterate over tr tags as you do. But you should be careful, because, for example first tr doesn't have td tags and you will get exception in the row:

print(row[0])

Here is the example of code that prints names of all companies:

import requests
from bs4 import BeautifulSoup

company = []
symbol = []

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}

url = 'https://www.slickcharts.com/sp500' #Data from SlickCharts
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.text, 'html.parser')

rows = soup.find_all('tr')
for row in rows:
    all_td_tags = row.find_all('td')
    if len(all_td_tags) > 0:
        print(all_td_tags[1].text)

But this code also outputs some other data besides company names. It's because you are iterating over all tr tags on the page. But you need to iterate over a specific table only (first table on the page in this case).

import requests
from bs4 import BeautifulSoup

company = []
symbol = []

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}

url = 'https://www.slickcharts.com/sp500' #Data from SlickCharts
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.text, 'html.parser')

first_table_on_the_page = soup.find('table')
rows = first_table_on_the_page.find_all('tr')
for row in rows:
    all_td_tags = row.find_all('td')
    if len(all_td_tags) > 0:
        print(all_td_tags[1].text)

  • Related