Could someone please explain exactly if there is a way to scrape links from this webpage https://hackmd.io/@nearly-learning/near-201 using BeautifulSoup or is it only possible with Selenium?
url = 'https://hackmd.io/@nearly-learning/near-201'
html = urlopen(url)
bs = BeautifulSoup(html.read(), 'lxml') # also tried all other parcers
links = bs.find_all('a') # only obtains 23 links, when there are actually loads more.
for link in links:
if 'href' in link.attrs:
print(link.attrs['href'])
Only obtain a few links and non in the actual body of the article.
I am however able to do it with Selenium:
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get("https://hackmd.io/@nearly-learning/near-201")
elems = driver.find_elements_by_xpath("//a[@href]")
for elem in elems:
print(elem.get_attribute("href"))
But would like to use BeautifulSoup if possible! Who knows if it is?
CodePudding user response:
If you don't want to use selenium
, you can use Markdown
package to render the markdown text to HTML and parse it with BeautifulSoup:
import markdown # pip install Markdown
import requests
from bs4 import BeautifulSoup
# 1. get raw markdown text
url = "https://hackmd.io/@nearly-learning/near-201"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
md_txt = soup.select_one("#doc").text
# 2. render the markdown to HTML
html = markdown.markdown(md_txt)
# 3. parse it again and find all <a> links
soup = BeautifulSoup(html, "html.parser")
for a in soup.select("a[href]"):
print(a["href"])
Prints:
https://cdixon.org/2018/02/18/why-decentralization-matters
https://docs.near.org/docs/concepts/gas#ballpark-comparisons-to-ethereum
https://docs.near.org/docs/roles/integrator/exchange-integration#blocks-and-finality
https://docs.near.org/docs/concepts/architecture/papers
https://explorer.near.org/nodes/validators
https://explorer.near.org/stats
https://docs.near.org/docs/develop/contracts/rust/intro
https://docs.near.org/docs/develop/contracts/as/intro
https://docs.near.org/docs/api/rpc
...and so on.
CodePudding user response:
As mentioned it needs selenium
or something similar to render all the content and sure you could use selenium
and BeautifulSoup
in da mix, if you prefer to select your elements in that way.
Just push the driver.page_source
to your BeautifulSoup()
bs = BeautifulSoup(driver.page_source)
Example
from bs4 import BeautifulSoup as bs
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get("https://hackmd.io/@nearly-learning/near-201")
bs = BeautifulSoup(driver.page_source)
for link in bs.select('a[href]'):
print(link['href'])