I am trying to scrape some ETF stock information from https://etfdb.com/etfs/sector/technology/#etfs&sort_name=assets_under_management&sort_order=desc&page=1 as a personal project.
What I am trying to do is scrape the tables shown for each of the pages but it seems to always return the same values even though I update the page number in the url. Is there some sort of limitation or something to do with the webpage that I am not considering? What can I do to scrape the tables from pages 1 through 5 from the above link?
The code that I am trying to use as follows:
import pandas as pd
import requests
def etf_table_scraper(industry):
# instatiate empty dataframe
df = pd.DataFrame()
# cycle through the pages
for page in range(1, 10):
url = f"https://etfdb.com/etfs/sector/{industry}/#etfs__returns&sort_name=symbol&sort_order=asc&page={page}"
r = requests.get(url)
df_list = pd.read_html(r.text)[0] # this parses all the tables in webpages to a list
# if first page, append
if page == 1:
df = df.append(df_list[0].iloc[:-1])
# otherwise check to see if there are overlaps
elif df_list.loc[0, 'Symbol'] not in df['Symbol'].unique():
df = df.append(df_list.iloc[:-1])
else:
break
return df
CodePudding user response:
So I saw the same issue as you when using requests. I was able to work around this though using Selenium though and clicking the next page button. Here's some sample code, you'd need to rework it to your flow though as this was just used for testing.
from selenium import webdriver
from time import sleep
import random
df = pd.DataFrame()
driver=webdriver.Chrome(executable_path="C:\chromedriver_win32\chromedriver.exe") ## Add your own path here
driver.get("https://etfdb.com/etfs/sector/technology/#etfs&sort_name=assets_under_management&sort_order=desc&page=1")
sleep(2)
text = driver.page_source # Get page source to get table
table_pg1 = pd.read_html(text)[0].iloc[:-1]
df = df.append(table_pg1)
sleep(2)
for i in range(1, 4):
driver.find_element_by_xpath('//*[@id="featured-wrapper"]/div[1]/div[4]/div[1]/div[2]/div[2]/div[2]/div[4]/div[2]/ul/li[8]/a').click()# Click next page button
sleep(3)
text = driver.page_source
table_pg_i = pd.read_html(text)[0].iloc[:-1]
df = df.append(table_pg_i)
driver.close()