Home > Net >  How to scrape this using bs4
How to scrape this using bs4

Time:07-17

I have to get <a aria-label="Last Page" href="https://webtoon-tr.com/webtoon/page/122/">Son »</a>. From this site:https://webtoon-tr.com/webtoon/

But when i try to scrape it with this code:

from bs4 import BeautifulSoup
import requests

url = "https://webtoon-tr.com/webtoon/"
html = requests.get(url).content
soup = BeautifulSoup(html,"html.parser")

last = soup.find_all("a",{"class":"last"})
print(last)

It just returns me an empty list, and when i try to scrape all "a" tags it only returns 2 which are completly different things.

Can somebody help me about it ? I really appreciate it.

CodePudding user response:

Try using the request_html library.

from bs4 import BeautifulSoup
import requests_html

url = "https://webtoon-tr.com/webtoon/"

s = requests_html.HTMLSession()

html = s.get(url)
soup = BeautifulSoup(html.content, "lxml")

last = soup.findAll("a", {"class":"last"})
print(last)
[<a aria-label="Last Page"  href="https://webtoon-tr.com/webtoon/page/122/">Son »</a>]

CodePudding user response:

Website is protected by Cloudflare. requests, cloudscraper or request_html doesn't work for me, only selenium:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup


chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--headless")

webdriver_service = Service("chromedriver/chromedriver") ## path to where you saved chromedriver binary
browser = webdriver.Chrome(service=webdriver_service, options=chrome_options)

browser.get("https://webtoon-tr.com/webtoon/")
soup = BeautifulSoup(browser.page_source, 'html5lib')
browser.quit()
link = soup.select_one('a.last')
print(link)

This returns

<a aria-label="Last Page"  href="https://webtoon-tr.com/webtoon/page/122/">Son »</a>
  • Related