I was building a web scraper to pull hrefs off of https://www.startengine.com/explore, but I was struggling to get any hrefs. I decided to print the webpage and figured out why.
Here is my code:
import pandas as pd
import os
import requests
from bs4 import BeautifulSoup
import re
URL = "https://www.startengine.com/explore"
page = requests.get(URL)
soup = BeautifulSoup(page.text, "html.parser")
links = []
print(soup)
This is the output:
<html>
<head><title>403 Forbidden</title></head>
<body>
<center><h1>403 Forbidden</h1></center>
</body>
</html>
Can someone help me work around the "403 Forbidden"?
CodePudding user response:
You need to inject your user-agent as header as follows:
import pandas as pd
import os
import requests
from bs4 import BeautifulSoup
import re
URL = "https://www.startengine.com/explore"
headers={'User-Agent':'mozilla/5.0'}
page = requests.get(URL,headers=headers)
print(page)
soup = BeautifulSoup(page.text, "html.parser")
links = []
print(soup)