I am trying to web scrape this web site, but the pages content changes when scroll and only 20 entries are shown.
As shown my code below, it only gives 20 enteries and if I scroll down before running the code, the 20 entries change.
I want to get the whole 896 entries all at once.
main = requests.get("https://www.sarbanes-oxley-forum.com/category/20/general-sarbanes-oxley- discussion/21")
soup = BeautifulSoup(main.content,"lxml")
main_dump = soup.find_all("h2",{"class":"title","component":{"topic/header"}})
for k in range(len(main_dump)):
main_titles.append(main_dump[k].find("a").text)
main_links.append("https://www.sarbanes-oxley-forum.com/" main_dump[k].find("a").attrs["href"])
print(len(main_links))
Output: 20
CodePudding user response:
It appears the website you are trying to scrape is using dynamic loading, which means that new content is loaded as the user scrolls down the page. This can make it difficult to retrieve all of the content at once using traditional web scraping methods (such as request
library).
One solution to your problem is to use a scraping tool that can interact with the website in the same way a user would, such as Selenium or Puppeteer if you want to use JavaScript.
Alternatively, you can inspect the website's network traffic to find the endpoint that is used to retrieve new content, usually using GET or POST request.
CodePudding user response:
You do not need BeautifulSoup
nor selenium
in this case, cause you could get all data structured from their api. Simply request the first page, check the number of topics and iterate over:
https://www.sarbanes-oxley-forum.com/api/category/20/general-sarbanes-oxley-discussion/1
Example
Note: Replace 894
with 1
to start over from first page, I just limited the number of requests for demo here starting from page 894
:
import requests
api_url = 'https://www.sarbanes-oxley-forum.com/api/category/20/general-sarbanes-oxley-discussion/'
data = []
for i in range(894,requests.get(api_url '1').json()['totalTopicCount'] 1):
for e in requests.get(api_url str(i)).json()['topics']:
data.append({
'title':e['title'],
'url': 'https://www.sarbanes-oxley-forum.com/topic/' e['slug']
})
data
Output
[{'title': 'We have a team of expert',
'url': 'https://www.sarbanes-oxley-forum.com/topic/8550/we-have-a-team-of-expert'},
{'title': 'What is the privacy in Google Nest Wifi?',
'url': 'https://www.sarbanes-oxley-forum.com/topic/8552/what-is-the-privacy-in-google-nest-wifi'},
{'title': 'Reporting Requirements _and_amp; Financial Results Release Timin 382',
'url': 'https://www.sarbanes-oxley-forum.com/topic/6214/reporting-requirements-_and_amp-financial-results-release-timin-382'},
{'title': 'Use of digital signatures instead of wet ink signatures on control documentation',
'url': 'https://www.sarbanes-oxley-forum.com/topic/8476/use-of-digital-signatures-instead-of-wet-ink-signatures-on-control-documentation'},...]