Home > Software design >  I'm trying to do web scraping with Selenium but it returns an empty list
I'm trying to do web scraping with Selenium but it returns an empty list

Time:12-27

I am quite new to Selenium and I need to list the name of each call that is in 'open for submission' status from Europa Commission's Funding&Tenders site. I've read that some interactive sites should take different approaches, but this site I'm trying to scrape doesn't seem interactive at all.

So I thought that I could overcome this situation with this very simple and plain code block I wrote below, but whenever I run the code, it returns with an empty list, I could not understand exactly where I went wrong and what I should do.

from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By

driver_service = Service(executable_path="/content/chromedriver.exe")

driver = webdriver.Chrome(service = driver_service)

driver.get("https://ec.europa.eu/info/funding-tenders/opportunities/portal/screen/opportunities/topic-search;callCode=null;freeTextSearchKeyword=;matchWholeText=true;typeCodes=1,2;statusCodes=31094501,31094502;programmePeriod=2021 - 2027;programCcm2Id=43108390;programDivisionCode=null;focusAreaCode=null;destinationGroup=null;missionGroup=null;geographicalZonesCode=null;programmeDivisionProspect=null;startDateLte=null;startDateGte=null;crossCuttingPriorityCode=null;cpvCode=null;performanceOfDelivery=null;sortQuery=sortStatus;orderBy=asc;onlyTenders=false;topicListKey=topicSearchTablePageState")

driver.implicitly_wait(10)

#The desired texts are included in the wordstohighlight class.

elements = driver.find_elements(By.CLASS_NAME, 'wordsToHighlight')

print(elements)

CodePudding user response:

As other solution, you can use their API endpoint to load the data directly, without selenium:

import json
import requests


api_url = "https://api.tech.ec.europa.eu/search-api/prod/rest/search"
params = {"apiKey": "SEDIA", "text": "***", "pageSize": "50", "pageNumber": "1"}

query = {
    "bool": {
        "must": [
            {"terms": {"type": ["1", "2"]}},
            {"terms": {"status": ["31094501", "31094502"]}},
            {"term": {"programmePeriod": "2021 - 2027"}},
            {"terms": {"frameworkProgramme": ["43108390"]}},
        ]
    }
}

languages = ["en"]

sort = {"field": "sortStatus", "order": "ASC"}

data = requests.post(
    api_url,
    params=params,
    files={
        "query": ("blob", json.dumps(query), "application/json"),
        "languages": ("blob", json.dumps(languages), "application/json"),
        "sort": ("blob", json.dumps(sort), "application/json"),
    },
).json()

# ucomment this to print all data:
# print(json.dumps(data, indent=4))

for r in data["results"]:
    print(r["content"])

Prints:

ERC PROOF OF CONCEPT GRANTS
ERC CONSOLIDATOR GRANTS
More sustainable buildings with reduced embodied energy / carbon, high life-cycle performance and reduced life-cycle costs (Built4People)
Designs, materials and solutions to improve resilience, preparedness & responsiveness of the built environment for climate adaptation (Built4People)
Integrated wind farm control
Recycling end of life PV modules
Smart-grid ready and smart-network ready buildings, acting as active utility nodes (Built4People)
Efficient and circular artificial photosynthesis
Development of digital solutions for existing hydropower operation and maintenance

...

CodePudding user response:

You can try using one of Selenium's built-in wait mechanisms to ensure that the elements are present on the page before you try to access them One option:- Increase the timeout to 20 seconds

driver.implicitly_wait(20) # Increase the timeout to 20 seconds

elements = driver.find_elements(By.CLASS_NAME, 'wordsToHighlight')

print(elements)

Other one is to use the WebDriverWait class to wait for the elements to be present on the page before accessing them

from selenium.webdriver.common.by import By

from selenium.webdriver.support.ui import WebDriverWait

from selenium.webdriver.support import expected_conditions as EC

wait = WebDriverWait(driver, 20) # Set the timeout to 20 seconds

elements = wait.until(EC.presence_of_all_elements_located((By.CLASS_NAME, 'wordsToHighlight')))

print(elements)

  • Related