Home > other >  Same Values Returned When Scraping with BeautifulSoup
Same Values Returned When Scraping with BeautifulSoup

Time:01-01

I am trying to scrape some ETF stock information from https://etfdb.com/etfs/sector/technology/#etfs&sort_name=assets_under_management&sort_order=desc&page=1 as a personal project.

What I am trying to do is scrape the tables shown for each of the pages but it seems to always return the same values even though I update the page number in the url. Is there some sort of limitation or something to do with the webpage that I am not considering? What can I do to scrape the tables from pages 1 through 5 from the above link?

The code that I am trying to use as follows:

import pandas as pd
import requests

def etf_table_scraper(industry):
  # instatiate empty dataframe
  df = pd.DataFrame()

  # cycle through the pages
  for page in range(1, 10):
      url = f"https://etfdb.com/etfs/sector/{industry}/#etfs__returns&sort_name=symbol&sort_order=asc&page={page}"
      r = requests.get(url)
      df_list = pd.read_html(r.text)[0] # this parses all the tables in webpages to a list
      
      # if first page, append
      if page == 1:
        df = df.append(df_list[0].iloc[:-1])

      # otherwise check to see if there are overlaps
      elif df_list.loc[0, 'Symbol'] not in df['Symbol'].unique():
        df = df.append(df_list.iloc[:-1])
     
      else:
        break

  return df

CodePudding user response:

So I saw the same issue as you when using requests. I was able to work around this though using Selenium though and clicking the next page button. Here's some sample code, you'd need to rework it to your flow though as this was just used for testing.

from selenium import webdriver
from time import sleep
import random


df = pd.DataFrame()

driver=webdriver.Chrome(executable_path="C:\chromedriver_win32\chromedriver.exe") ## Add your own path here
driver.get("https://etfdb.com/etfs/sector/technology/#etfs&sort_name=assets_under_management&sort_order=desc&page=1")

    sleep(2)

 
text = driver.page_source # Get page source to get table
table_pg1 = pd.read_html(text)[0].iloc[:-1]
df = df.append(table_pg1)

sleep(2)

for i in range(1, 4):
    
    driver.find_element_by_xpath('//*[@id="featured-wrapper"]/div[1]/div[4]/div[1]/div[2]/div[2]/div[2]/div[4]/div[2]/ul/li[8]/a').click()# Click next page button
    sleep(3)
    text = driver.page_source
    table_pg_i = pd.read_html(text)[0].iloc[:-1]
    df = df.append(table_pg_i)
    
driver.close()
  • Related