Driver.get() function in Selenium on Python 3 not opening URL-CodePudding

The University of Mainz are offering biographic data in XML format relating to early modern professors of the institution via an API whose URL is:

http://gutenberg-biographics.ub.uni-mainz.de/api/items/persons/

I am trying to access each of the files linked here to save them locally for further data analysis.

For this purpose, I am using Selenium in Python 3. Here are the first few linkes of code:

# Starting SELENIUM for web automation

import selenium
from selenium import webdriver
from selenium.webdriver import Chrome
from selenium.webdriver.common.keys import Keys
import codecs

# open new browser session

driver = webdriver.Chrome(executable_path='C:\\Program Files\\Google\\Chrome\\Application\\chrome.exe')
driver.maximize_window()

# Navigate to the application home page

driver.get("http://gutenberg-biographics.ub.uni-mainz.de/api/items/persons/")

My problem is that the driver.get() does not take me to the indicated URL. Chrome is correctly initialised and the window is maximised as requested, but then nothing else happens. There is not error notification --- the script keeps running without doing anything else.

I read in some posts that a mis-match between the version of Selenium and the Chrome browser version can be the reason, so I have updated all my packages in Anaconda (including Selenium) to the latest version.

Unfortunately, this did not fix my issue. Can anyone help?

CodePudding user response：

The URL given in the question returns an XML document. That document contains 'resource' tags and each of those has an 'href' attribute. Those links (URLs) also return XML documents with a TEI element at their roots.

The TEI documents contain information about some person. Those documents have (for example) an element called 'reg' which contains the person's surname and forename combined.

So here's an example of how you could get all of the persons' names:

import requests
from lxml import etree
from concurrent.futures import ThreadPoolExecutor, as_completed

def local_name(e):
    return f"//*[local-name()='{e}']"

def process(url):
    try:
        (r := requests.get(url)).raise_for_status()
        root = etree.fromstring(r.content)
        if (reg := root.xpath(local_name('reg'))):
            return reg[0].text
    except Exception:
        pass
    return 'n/a'

def main():
    (r := requests.get('http://gutenberg-biographics.ub.uni-mainz.de/api/items/persons/')).raise_for_status()
    root = etree.fromstring(r.content)
    urls = [resource.attrib['href'] for resource in root.xpath(local_name('resource'))]
    with ThreadPoolExecutor() as executor:
        print([t for t in executor.map(process, urls)])

if __name__ == '__main__':
    main()

Note:

There are over 1,000 resources in the main XML document so sequential processing would take a very long time. I've introduced multithreading to the code for better performance. Even so, this runs for ~50s on my machine

CodePudding user response：

Thanks for all your suggestions!

I figured out that, first of all, my chromedriver.exe was not up to date.

But I also took up the suggestion to use requests instead of selenium and came up with the following script to download all linked XML files:

# Script to scrape XML files from Gutenberg Biographics API

import requests
import urllib.request
import os
from bs4 import BeautifulSoup
import bs4.builder._lxml
from xml.etree.ElementTree import XML, fromstring

# URL to be called

gutenberg_url="http://gutenberg-biographics.ub.uni-mainz.de/api/items/persons/"

# Function to extract html document from given url
# as suggested on https://www.geeksforgeeks.org/beautifulsoup-scraping-link-from-html/

def getHTMLdocument(url):
      
    # request for HTML document of given url
    response1 = requests.get(url)
      
    # response will be provided in JSON format
    return response1.text

# Navigate to the application home page

html_document = getHTMLdocument(gutenberg_url)
soup = BeautifulSoup(html_document, 'xml')

# Find links for all XML files

links = soup.find_all('resource')

# create counter to number files

counter=0
no_links=len(links)

# traverse list and get individual XML URLs

for lnk in links:
    index=links.index(lnk)
    counter= index
    print(counter)
    print(lnk)
    l = lnk.get("href")
    print(l) 
    
# use new URL to access individual XML files

    response2 = requests.get(l)
    outfile=response2.text
    #print(outfile)
    #print(type(outfile)) returns XML as string
    
# save each XML files to local drive

    with open(os.path.join("C:\\Users\\#####\\ProfAPI", str(counter)   '.xml'), 'w', encoding="utf-8") as f:
        f.write(outfile)
        print("File no.", counter, "downloaded!")
                    
print("Done")

# close the browser window
driver.quit()

Extracting the data I need to then carried out by iterating through my local directory, which seemed to be a better solution.