Why i have HTTP Error 503 using urllib and BS4?-CodePudding

I use BS4 to get Browse Standards by Technology from website: https://standards.globalspec.com/

from urllib.request import urlopen
from bs4 import BeautifulSoup
url = "https://standards.globalspec.com/"
q1 = urlopen(url)
soup = BeautifulSoup(q1, 'lxml')
print(soup)

But i have an error: urllib.error.HTTPError: HTTP Error 503: Service Temporarily Unavailable

Could anyone see what could be causing this error?

CodePudding user response：

@Samt94 already has stated that the website is under cloudflare protection. So you can use cloudscraper instead of requests

from bs4 import BeautifulSoup
import cloudscraper
scraper = cloudscraper.create_scraper(delay=10,   browser={'custom': 'ScraperBot/1.0',})
url = 'https://standards.globalspec.com/'
req = scraper.get(url)
print(req)
soup = BeautifulSoup(req.text,'lxml')

Output:

  <Response [200]>

cloudscraper

CodePudding user response：

You can use CloudScraper to access websites that use CloudFlare DDoS Protection:

from bs4 import BeautifulSoup
import cloudscraper

url = "https://standards.globalspec.com/"


scraper = cloudscraper.create_scraper()
q1 = scraper.get(url)
soup = BeautifulSoup(q1.text, 'lxml')
print(soup)