I have a hard time browsing through the 448 consecutive pages of the following page https://www.digitalwallonia.be/fr/cartographie/ with Selenium under Python in a robust manner. I tried (too) many things without satisfactory result (hence, difficult to put relevant code).
Would like to see your solution. Apologize if the question is not appropriately formulated: first timer.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
browser = webdriver.Firefox()
browser.implicitly_wait(20)
browser.get('https://www.digitalwallonia.be/fr/cartographie')
browser.find_element("xpath",'//*[@id="axeptio_btn_acceptAll"]').click()
browser.find_element("xpath",'//*[@id="axeptio_btn_configure"]').click()
browser.find_element("xpath",'//*[@id="axeptio_btn_acceptAllAndNext"]').click()
WebDriverWait(browser, 1000).until(EC.element_to_be_clickable((By.CLASS_NAME,'next'))).click()
input('Press ENTER to close the automated browser')
browser.quit()
I get the following error: selenium.common.exceptions.ElementNotInteractableException: Message: Element could not be scrolled into view
CodePudding user response:
I would advice here about several issues:
- You should preferably use
WebDriverWait
, notimplicitly_wait
since the former is waiting for element presence only while withWebDriverWait
you can wait for more mature element states i.e. to be visible, clickable and more. - Don't mix
WebDriverWait
andimplicitly_wait
in the same file, it may cause problems. - The
next page
buttons are on the bottom of the page, so you will need to scrool down and only after that to click the pager button. - No need to set the timeout for more than 30 seconds.
The code below is working:
import time
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
options = Options()
options.add_argument("start-maximized")
webdriver_service = Service('C:\webdrivers\chromedriver.exe')
driver = webdriver.Chrome(service=webdriver_service, options=options)
url = "https://www.digitalwallonia.be/fr/cartographie"
actions = ActionChains(driver)
wait = WebDriverWait(driver, 10)
driver.get(url)
wait.until(EC.element_to_be_clickable((By.XPATH, '//*[@id="axeptio_btn_acceptAll"]'))).click()
wait.until(EC.element_to_be_clickable((By.XPATH, '//*[@id="axeptio_btn_configure"]'))).click()
wait.until(EC.element_to_be_clickable((By.XPATH, '//*[@id="axeptio_btn_acceptAllAndNext"]'))).click()
driver.execute_script("window.scrollBy(0, arguments[0]);", 800)
time.sleep(0.5)
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, '.next a'))).click()
CodePudding user response:
Every time you click to go to next page ('Suivant' button), the javascript in page is making a POST request to an API endpoint, with a header and a payload. Header, payload and API endpoint can be found in browser Dev tools - Network tab (select only XHR calls). Hence, we can try and scrape that API url using requests and avoiding the overheads of selenium/chromedriver. Below is a way of obtaining that data:
import requests
import pandas as pd
big_df = pd.DataFrame()
url = 'https://search.production.ribo.digitalwallonia.be/contentful-entries_production/_search/template'
headers = {
'content-type': 'application/json',
'Origin': 'https://www.digitalwallonia.be',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}
s = requests.Session()
s.headers.update(headers)
counter = 0
while True:
payload = '{"id":"filter-profile-search-template-fr-v3","params":{"categoriesSlugList":[],"programsSlugList":[],"from":' str(counter) ',"regionsList":[],"size":100}}'
r = s.post(url, data=payload)
big_df = pd.concat([big_df, pd.json_normalize(r.json()['hits']['hits'])], axis=0, ignore_index=True)
counter = counter 100
if counter > 448*12:
break
print(big_df)
We are getting 100 items at once (the actual page is getting 12 at once). After a minute or so, you should have the following dataframe displayed in your terminal:
_index _type _id _score sort _source.sys.id _source.sys.contentType.sys.id _source.sys.updatedAt _source.fields.addresses.fr _source.fields.belgianEnterprisesNumbers.fr _source.fields.urlsWebSite.fr _source.fields.shortDescription.en _source.fields.shortDescription.fr _source.fields.logoAssetImage.fr.file.en.fileName _source.fields.logoAssetImage.fr.file.en.details.image.width _source.fields.logoAssetImage.fr.file.en.details.image.height _source.fields.logoAssetImage.fr.file.en.details.size _source.fields.logoAssetImage.fr.file.en.contentType _source.fields.logoAssetImage.fr.file.en.url _source.fields.logoAssetImage.fr.file.fr.fileName _source.fields.logoAssetImage.fr.file.fr.details.image.width _source.fields.logoAssetImage.fr.file.fr.details.image.height _source.fields.logoAssetImage.fr.file.fr.details.size _source.fields.logoAssetImage.fr.file.fr.contentType _source.fields.logoAssetImage.fr.file.fr.url _source.fields.logoAssetImage.fr.title.en _source.fields.logoAssetImage.fr.title.fr _source.fields.title.en _source.fields.title.fr _source.fields.slug.en _source.fields.slug.fr _source.fields.urlsSocialNetwork.fr _source.fields.shortTitle.en _source.fields.shortTitle.fr _source.fields.founders.fr _source.fields.mainNaceCode.fr _source.fields.staffing.fr _source.fields.logoAssetImage.fr _source.fields.partnersAdditionalDescriptions.fr _source.fields.incubators.fr
0 contentful-entries_productionv3 _doc 3O1t8sTHhj5ZGrmGKtHI6y None [ Dynamix JAVA] 3O1t8sTHhj5ZGrmGKtHI6y profile 2022-09-01T14:36:06.899Z [{'Geometry': {'Viewport': {'Southwest': {'Lng': 4.388591169708497, 'Lat': 50.7035958197085}, 'Northeast': {'Lng': 4.391289130291502, 'Lat': 50.7062937802915}}, 'coordinates': [4.3898572, 50.7050388], 'type': 'Point', 'Location': {'Lng': 4.3898572, 'Lat': 50.7050388}}, 'Metadata': {'PlaceId': 'ChIJOZeR297Rw0cR_y-bZPZvwzQ', 'AddressType': 'head office', 'Timestamp': '2022-08-29T13:55:32.180Z'}, 'FormattedAddress': 'Av. des Dauphins 17, 1410 Waterloo, Belgique', 'MainAddress': True}] [0715677777] [{'Metadata': {'Timestamp': '2022-08-29T15:58:45 02:00'}, 'URL': 'https://dynamix-it.be/'}] Consulting company specialised in JAVA, SAP, DotNet, and son one. Société de consultance spécialisée en JAVA, SAP, DotNet, etc. dynamix_java.png 160.0 160.0 15950.0 image/png //images.ctfassets.net/myqv2p4gx62v/3jrjoVohZ1ooo2VMkum0Ns/1e5bd1ac59dab0126baea85f9156b872/dynamix_java.png dynamix java.png 160.0 160.0 15950.0 image/png //images.ctfassets.net/myqv2p4gx62v/3jrjoVohZ1ooo2VMkum0Ns/8e23b45bf77a17026df43cd072d06a52/dynamix_java.png Dynamix Java Dynamix Java Dynamix JAVA Dynamix JAVA dynamix-java dynamix-java [{'Metadata': {'Timestamp': '2022-08-29T15:58:14 02:00'}, 'URL': 'https://www.facebook.com/DYNAMIXJAVASPRL'}, {'Metadata': {'Timestamp': '2022-08-29T15:58:27 02:00'}, 'URL': 'https://www.linkedin.com/company/dynamixjava/'}] NaN NaN NaN NaN NaN NaN NaN NaN
1 contentful-entries_productionv3 _doc 4D2kOg0t4iRD11fzJFaPc8 None [ Lan-Area ] 4D2kOg0t4iRD11fzJFaPc8 profile 2022-08-25T08:42:32.473Z [{'Geometry': {'Viewport': {'Southwest': {'Lng': 4.744188919708497, 'Lat': 50.3149442697085}, 'Northeast': {'Lng': 4.746886880291502, 'Lat': 50.3176422302915}}, 'coordinates': [4.745529299999999, 50.31632769999999], 'type': 'Point', 'Location': {'Lng': 4.745529299999999, 'Lat': 50.31632769999999}}, 'Metadata': {'PlaceId': 'ChIJm9XAKz6SwUcRs45ovYpmEpc', 'AddressType': 'head office', 'Timestamp': '2022-06-21T14:17:33.655Z'}, 'FormattedAddress': 'Rue d'Ermeton 14, 5537 Anhée, Belgique', 'MainAddress': True}] [0779822986] [{'Metadata': {'Timestamp': '2022-08-25T10:42:29 02:00'}, 'URL': 'https://www.lan-area.be/'}] Platform exclusively focused on local sports competition. Lan-Area has created a central calendar where all local events are announced and a Belgian community space where players can post their teams, courses and successes. Plateforme exclusivement tournée vers la compétition e-sportive locale . Lan-Area a créé un calendrier central où tous les évènements locaux sont annoncés et un espace communautaire belge où les joueurs peuvent afficher leurs équipes, parcours et succès. lan-Aera.jpg 450.0 250.0 21154.0 image/jpeg //images.ctfassets.net/myqv2p4gx62v/3Gg1nuukov4gaypTawIQs8/346ee9006b0b5e3e33d2fab6ce293a47/lan-Aera.jpg lan-Aera.jpg 450.0 250.0 21154.0 image/jpeg //images.ctfassets.net/myqv2p4gx62v/3Gg1nuukov4gaypTawIQs8/7f30ce6782073cf51d16c1f67ef5ee0d/lan-Aera.jpg lan-Aera Logo Lan-Aera Lan-Aera Lan-Area lan-aera lan-area [{'Metadata': {'Timestamp': '2022-06-21T15:06:34 02:00'}, 'URL': 'https://www.facebook.com/lanarea2020'}, {'Metadata': {'Timestamp': '2022-06-21T15:07:31 02:00'}, 'URL': 'https://twitter.com/LanArea5'}, {'Metadata': {'Timestamp': '2022-06-21T15:59:53 02:00'}, 'URL': 'https://www.twitch.tv/ladh_lanarea'}] NaN NaN NaN NaN NaN NaN NaN NaN
2 contentful-entries_productionv3 _doc 6sbdRDRWJXTTtbR1wycE52 None [1-formation.be] 6sbdRDRWJXTTtbR1wycE52 profile 2022-05-15T11:21:20.388Z [{'Geometry': {'Viewport': {'Southwest': {'Lng': 4.863200770107277, 'Lat': 50.46117977010727}, 'Northeast': {'Lng': 4.865900429892721, 'Lat': 50.46387942989271}}, 'coordinates': [4.864224099999999, 50.462539], 'type': 'Point', 'Location': {'Lng': 4.864224099999999, 'Lat': 50.462539}}, 'Metadata': {'PlaceId': 'ChIJa-SkInKZwUcRsc1Xs-GqwSE', 'AddressType': 'head office', 'Timestamp': '2022-05-07T18:43:01.598Z'}, 'FormattedAddress': 'Rue des Fossés Fleuris 42, 5000 Namur, Belgique', 'MainAddress': True}] [0891973792] [{'Metadata': {'Timestamp': '2022-05-07T18:43:01.459Z'}, 'URL': 'http://www.1-formation.be/'}] Training in IT following based on four subjects: office applications, web and image, web marketing and communication, personnel management and development. Formations en informatique suivant quatre thématiques: bureautique, web et image, webmarketing et communication, management et développement personnel. NaN NaN NaN NaN NaN NaN logo-f-1-formation.jpg 350.0 77.0 5569.0 image/jpeg //images.ctfassets.net/myqv2p4gx62v/7Itx3K16vYyGTHuYUD7TfW/7103d85dbce48d1c3a0535dac76df5c0/logo-f-1-formation.jpg NaN logo-f-1-formation.jpg 1-formation.be 1-formation.be 1-formationbe 1-formationbe [{'Metadata': {'Timestamp': '2022-05-07T18:43:01.459Z'}, 'URL': 'https://twitter.com/1formation_be'}, {'Metadata': {'Timestamp': '2022-05-07T18:43:01.459Z'}, 'URL': 'https://www.facebook.com/1formation'}] NaN NaN NaN NaN NaN NaN NaN NaN
3 contentful-entries_productionv3 _doc 4EuOqP1eQIeka5xHcoq5mQ None [1-position.be] 4EuOqP1eQIeka5xHcoq5mQ profile 2022-05-15T11:21:23.274Z [{'Geometry': {'Viewport': {'Southwest': {'Lng': 4.863200770107277, 'Lat': 50.46117977010727}, 'Northeast': {'Lng': 4.865900429892721, 'Lat': 50.46387942989271}}, 'coordinates': [4.864224099999999, 50.462539], 'type': 'Point', 'Location': {'Lng': 4.864224099999999, 'Lat': 50.462539}}, 'Metadata': {'PlaceId': 'ChIJa-SkInKZwUcRsc1Xs-GqwSE', 'AddressType': 'head office', 'Timestamp': '2022-05-07T18:51:39.745Z'}, 'FormattedAddress': 'Rue des Fossés Fleuris 42, 5000 Namur, Belgique', 'MainAddress': True}] [0891973792] [] Communications agency and IT training centre: website creation, professional SEO, the creation of Google Adwords campaigns, copywriting and web content, visual identity creation, communications consulting. Agence de communication et centre de formation informatique: création de sites web, référencement professionnel, création et gestion de campagnes Google AdWords, copywriting et écriture web, création d'identité visuelle, conseil en communication. NaN NaN NaN NaN NaN NaN marque-cp52u3ifgt9us27gak951f15p6-1369821343-position.png 169.0 129.0 11128.0 image/png //images.ctfassets.net/myqv2p4gx62v/2RMVJINCIXiF4O2hZIb6kx/c1aebc77207c1a5ae67af5ebd87b1dd3/marque-cp52u3ifgt9us27gak951f15p6-1369821343-position.png NaN marque-cp52u3ifgt9us27gak951f15p6-1369821343-position.png 1-position.be 1-position.be 1-positionbe 1-positionbe [{'Metadata': {'Timestamp': '2022-05-07T18:51:39.679Z'}, 'URL': 'https://twitter.com/1position'}, {'Metadata': {'Timestamp': '2022-05-07T18:51:39.679Z'}, 'URL': 'https://www.facebook.com/pages/1-positionbe/147447630063'}, {'Metadata': {'Timestamp': '2022-05-07T18:51:39.679Z'}, 'URL': 'https://www.linkedin.com/company/1-position.be'}] NaN NaN NaN NaN NaN NaN NaN NaN
4 contentful-entries_productionv3 _doc 1VvYEZncg0lEDL8RzGAvmE None [123 Automation Engineering & Development] 1VvYEZncg0lEDL8RzGAvmE profile 2022-05-15T05:25:51.214Z [{'Geometry': {'Viewport': {'Southwest': {'Lng': 4.456926070107278, 'Lat': 50.53833147010727}, 'Northeast': {'Lng': 4.459625729892722, 'Lat': 50.54103112989272}}, 'coordinates': [4.4582759, 50.5396813], 'type': 'Point', 'Location': {'Lng': 4.4582759, 'Lat': 50.5396813}}, 'Metadata': {'PlaceId': 'EjNSdWUgZGVzIEFydGlzYW5zIDQsIDYyMTAgTGVzIEJvbnMgVmlsbGVycywgQmVsZ2lxdWUiGhIYChQKEgn75Aq3dyzCRxFEh7hEj1NdPBAE', 'AddressType': 'head office', 'Timestamp': '2022-05-07T15:17:32.918Z'}, 'FormattedAddress': 'Rue des Artisans 4, 6210 Les Bons Villers, Belgique', 'MainAddress': True}] [0820888531] [{'Metadata': {'Timestamp': '2022-05-07T15:17:32.867Z'}, 'URL': 'http://www.123automation.be/'}] NaN Automation et robotique industrielle: étude, conception, développement, intégration et maintenance de solutions automatisées visant l’amélioration de la productivité dans les processus de fabrication quels qu’ils soient. NaN NaN NaN NaN NaN NaN 123automation.png 319.0 111.0 5802.0 image/png //images.ctfassets.net/myqv2p4gx62v/6uY3Y6EDfICh8wdp4XNK7Z/082273035f7a600ec34098b09ab4fee9/123automation.png NaN 123automation.png 123 Automation Engineering & Development 123 Automation Engineering & Development 123-automation 123-automation [] NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5360 contentful-entries_productionv3 _doc 1AbDfyZ4rHL18Bw6aiJKSA None [École Centrale des Arts et Métiers - HE Vinci] 1AbDfyZ4rHL18Bw6aiJKSA profile 2022-05-15T11:43:23.005Z [{'Geometry': {'Viewport': {'Southwest': {'Lng': 4.452325870107279, 'Lat': 50.84853592010727}, 'Northeast': {'Lng': 4.455025529892723, 'Lat': 50.85123557989272}}, 'coordinates': [4.4538028, 50.8499896], 'type': 'Point', 'Location': {'Lng': 4.4538028, 'Lat': 50.8499896}}, 'Metadata': {'PlaceId': 'ChIJwdgtpYbcw0cRfjW1nUhDNk8', 'AddressType': 'head office', 'Timestamp': '2022-05-07T15:44:19.720Z'}, 'FormattedAddress': 'Prom. de l'Alma 50, 1200 Woluwe-Saint-Lambert, Belgique', 'MainAddress': True}] [0459279954, 0409454123] [{'Metadata': {'Timestamp': '2022-05-07T15:44:19.660Z'}, 'URL': 'http://www.ecam.be/'}] NaN L'ECAM est un Institut Supérieur Industriel ayant pour objet la formation de Master en sciences industrielles dans une des spécialités suivantes: automatisation, construction, électromécanique, électronique, géomètre, informatique, business analyst (alternance). NaN NaN NaN NaN NaN NaN ecam.jpg 512.0 512.0 93657.0 image/jpeg //images.ctfassets.net/myqv2p4gx62v/4e2oSTcbXRABuyibUwgs95/4e5d8f540ccc67065a94eb528418ddd7/ecam.jpg NaN ecam.jpg École Centrale des Arts et Métiers - HE Vinci École Centrale des Arts et Métiers - HE Vinci ecole-centrale-des-arts-et-metiers ecole-centrale-des-arts-et-metiers [] ECAM ECAM NaN NaN NaN NaN NaN NaN
5361 contentful-entries_productionv3 _doc 5vp8xZpO6CucXtOmc1H8yR None [École communale fondamentale de Seneffe] 5vp8xZpO6CucXtOmc1H8yR profile 2022-05-15T09:12:19.246Z [{'Geometry': {'Viewport': {'Southwest': {'Lng': 4.252977370107278, 'Lat': 50.52898217010728}, 'Northeast': {'Lng': 4.255677029892722, 'Lat': 50.53168182989272}}, 'coordinates': [4.2543333, 50.5303456], 'type': 'Point', 'Location': {'Lng': 4.2543333, 'Lat': 50.5303456}}, 'Metadata': {'PlaceId': 'ChIJt1KItgg0wkcR6ekUYWMbdDg', 'AddressType': 'head office', 'Timestamp': '2022-05-07T18:58:11.863Z'}, 'FormattedAddress': 'Rue de Buisseret 19, 7180 Seneffe, Belgique', 'MainAddress': True}] NaN [] NaN Ecole fondamentale. NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN École communale fondamentale de Seneffe École communale fondamentale de Seneffe ecole-communale-de-seneffe ecole-communale-de-seneffe [] NaN NaN NaN NaN NaN NaN NaN NaN
[...]
This dataframe has 5365 rows × 40 columns. You can inspect the initial json response and dissect it further, maybe you need more/less/other information from it.
Requests docs: https://requests.readthedocs.io/en/latest/
Pandas relevant documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html