I am working on a scraping project for a well-known ecommerce page. I would like the browser not to be displayed and the solution to the problem that always arises is to use the "--headless" option, but the page to be scraped does not allow "headless" . I tried too with "--no-startup-window" and it doesn't seem to work either. Does anyone have an alternative solution?
Here my code:
import random
from django.shortcuts import render
from bs4 import BeautifulSoup
from selenium import webdriver
#Selenium 4 with Chrome
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager
def wlista(request):
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36'
headers = {
'User-Agent':user_agent,
'Accept': 'text/html,application/xhtml xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'es-ES;es;q=0.8',
'DNT': '1',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
}
opciones = webdriver.ChromeOptions()
opciones.add_argument(user_agent)
#opciones.add_argument('--headless')
#opciones.add_argument('--no-startup-window')
opciones.add_experimental_option('excludeSwitches', ['enable-automation'])
opciones.add_experimental_option('excludeSwitches', ['enable-logging'])
opciones.add_experimental_option('useAutomationExtension', False)
DRIVER = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), chrome_options=opciones)
DRIVER.get('https://www.walmart.com/search?q=lego toys')
soup = BeautifulSoup(DRIVER.page_source, 'html.parser')
rows = soup.find_all(attrs={"data-item-id": True})
for items in rows:
#do something
pass
DRIVER.quit
return render(request, "Proyectowebapp/listaprods.html", {
#variables to pass
})
Thanks for the help!
CodePudding user response:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options=Options()
options.add_argument('--headless')
CodePudding user response:
Some of the websites won't allow running in 'headless' mode.
I tried using 'headless' mode for the link you mentioned - 'https://www.walmart.com/search?q=lego toys' and printing the title. It printed the title as 'Robot or human?'.
But without 'headless' mode it printed the correct title - 'lego toys - Walmart.com'.
Also, there is another example, the website - 'https://www.redbus.in/', while trying to print the title in 'headless' mode it used to print 'Access Denied'.