I'm scraping news-articles from a website where there is no load-more button in a specific category page, the news article links are being generated as I scroll down. I wrote a function which take input category_page_url and limit_page(how many times I want to scroll down) and return me back all the links of the news articles displayed in that page.
Category page link = https://www.scmp.com/topics/trade
def get_article_links(url, limit_loading):
options = webdriver.ChromeOptions()
lists = ['disable-popup-blocking']
caps = DesiredCapabilities().CHROME
caps["pageLoadStrategy"] = "normal"
options.add_argument("--window-size=1920,1080")
options.add_argument("--disable-extensions")
options.add_argument("--disable-notifications")
options.add_argument("--disable-Advertisement")
options.add_argument("--disable-popup-blocking")
driver = webdriver.Chrome(executable_path= r"E:\chromedriver\chromedriver.exe", options=options) #add your chrome path
driver.get(url)
last_height = driver.execute_script("return document.body.scrollHeight")
loading = 0
while loading < limit_loading:
loading = 1
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(8)
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height
article_links = []
bsObj = BeautifulSoup(driver.page_source, 'html.parser')
for i in bsObj.find('div', {'class': 'content-box'}).find('div', {'class': 'topic-article-container'}).find_all('h2', {'class': 'article__title'}):
article_links.append(i.a['href'])
return article_links
Assuming I want to scroll 5 times in this category page,
get_article_links('https://www.scmp.com/topics/trade', 5)
But even if I change the number of my limit_page it return me back only the links from first page, there is some mistake I've done to write the scrolling part. Please help me with this.
CodePudding user response:
Instead of scrolling using per body scrollHeight
property, I checked to see if there was any appropriate element after the list of articles to scroll to. I noticed this appropriately named div
:
<div data-v-db98a5c0=""></div>
Accordingly, I primarily changed the while
loop in your function get_article_links
to scroll to this div using location_once_scrolled_into_view after finding the div before the loop starts, as follows:
loading = 0
end_div = driver.find_element('class name','topic-content__load-more-anchor')
while loading < limit_loading:
loading = 1
print(f'scrolling to page {loading}...')
end_div.location_once_scrolled_into_view
time.sleep(2)
If we now call the function with different limit_loading
, we get different count of unique news links. Here are couple of runs:
>>> ar_links = get_article_links('https://www.scmp.com/topics/trade', 2)
>>> len(ar_links)
scrolling to page 1...
scrolling to page 2...
90
>>> ar_links = get_article_links('https://www.scmp.com/topics/trade', 3)
>>> len(ar_links)
scrolling to page 1...
scrolling to page 2...
scrolling to page 3...
120