I am very new to coding and am trying to write a practice script for webscraping in VS Code Editor. But every time i run the script i get this issue of there being no real output. Can you please advise on what the issue is? Note: the pink boxes are just covering my nameenter image description here
I tried running the code and expected webscraped data from the link. I have tried many different scripts and the same issue happens. So there must be something wrong with the whole system i think
CodePudding user response:
VSCode is an excellent IDE. When you start a new project (or open a folder in VSCode), it does not come with any build tools or compilers etc. You have to manually configure them. You have to set up the environment using different toolchains. Here are some instructions for Python
CodePudding user response:
This is not a problem with VSCode but I am going to answer your question.
You can't webscrape indeed.com with requests and beatiful soup because it has bot protection using cloudflare. If you take a closer look to the response it returns the 403 Forbidden status code instead of 200 OK. You can scrape using a headless browser using selenium.
Here's an example
First install selenium and webdriver_manager
pip install selenium webdriver_manager
from selenium.webdriver import Chrome, ChromeOptions
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
# Make sure you are not detected as HeadlessChrome, some sites will refuse access
options = ChromeOptions()
options.add_argument('--disable-blink-features=AutomationControlled')
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = Chrome(options=options, service=Service(
ChromeDriverManager().install()))
# Make sure you are not detected as HeadlessChrome, some sites will refuse access
ua = driver.execute_script("return navigator.userAgent").replace(
"HeadlessChrome", "Chrome")
driver.execute_cdp_cmd("Network.setUserAgentOverride", {
"userAgent": ua})
driver.execute_script(
"Object.defineProperty(navigator,'webdriver',{get:()=>undefined});")
driver.get("https://www.indeed.com/companies/best-Agriculture-companies")
main = driver.find_element(By.ID, "main")