The University of Mainz are offering biographic data in XML format relating to early modern professors of the institution via an API whose URL is:
http://gutenberg-biographics.ub.uni-mainz.de/api/items/persons/
I am trying to access each of the files linked here to save them locally for further data analysis.
For this purpose, I am using Selenium in Python 3. Here are the first few linkes of code:
# Starting SELENIUM for web automation
import selenium
from selenium import webdriver
from selenium.webdriver import Chrome
from selenium.webdriver.common.keys import Keys
import codecs
# open new browser session
driver = webdriver.Chrome(executable_path='C:\\Program Files\\Google\\Chrome\\Application\\chrome.exe')
driver.maximize_window()
# Navigate to the application home page
driver.get("http://gutenberg-biographics.ub.uni-mainz.de/api/items/persons/")
My problem is that the driver.get()
does not take me to the indicated URL. Chrome is correctly initialised and the window is maximised as requested, but then nothing else happens. There is not error notification --- the script keeps running without doing anything else.
I read in some posts that a mis-match between the version of Selenium and the Chrome browser version can be the reason, so I have updated all my packages in Anaconda (including Selenium) to the latest version.
Unfortunately, this did not fix my issue. Can anyone help?
CodePudding user response:
The URL given in the question returns an XML document. That document contains 'resource' tags and each of those has an 'href' attribute. Those links (URLs) also return XML documents with a TEI element at their roots.
The TEI documents contain information about some person. Those documents have (for example) an element called 'reg' which contains the person's surname and forename combined.
So here's an example of how you could get all of the persons' names:
import requests
from lxml import etree
from concurrent.futures import ThreadPoolExecutor, as_completed
def local_name(e):
return f"//*[local-name()='{e}']"
def process(url):
try:
(r := requests.get(url)).raise_for_status()
root = etree.fromstring(r.content)
if (reg := root.xpath(local_name('reg'))):
return reg[0].text
except Exception:
pass
return 'n/a'
def main():
(r := requests.get('http://gutenberg-biographics.ub.uni-mainz.de/api/items/persons/')).raise_for_status()
root = etree.fromstring(r.content)
urls = [resource.attrib['href'] for resource in root.xpath(local_name('resource'))]
with ThreadPoolExecutor() as executor:
print([t for t in executor.map(process, urls)])
if __name__ == '__main__':
main()
Note:
There are over 1,000 resources in the main XML document so sequential processing would take a very long time. I've introduced multithreading to the code for better performance. Even so, this runs for ~50s on my machine
CodePudding user response:
Thanks for all your suggestions!
I figured out that, first of all, my chromedriver.exe
was not up to date.
But I also took up the suggestion to use requests
instead of selenium
and came up with the following script to download all linked XML files:
# Script to scrape XML files from Gutenberg Biographics API
import requests
import urllib.request
import os
from bs4 import BeautifulSoup
import bs4.builder._lxml
from xml.etree.ElementTree import XML, fromstring
# URL to be called
gutenberg_url="http://gutenberg-biographics.ub.uni-mainz.de/api/items/persons/"
# Function to extract html document from given url
# as suggested on https://www.geeksforgeeks.org/beautifulsoup-scraping-link-from-html/
def getHTMLdocument(url):
# request for HTML document of given url
response1 = requests.get(url)
# response will be provided in JSON format
return response1.text
# Navigate to the application home page
html_document = getHTMLdocument(gutenberg_url)
soup = BeautifulSoup(html_document, 'xml')
# Find links for all XML files
links = soup.find_all('resource')
# create counter to number files
counter=0
no_links=len(links)
# traverse list and get individual XML URLs
for lnk in links:
index=links.index(lnk)
counter= index
print(counter)
print(lnk)
l = lnk.get("href")
print(l)
# use new URL to access individual XML files
response2 = requests.get(l)
outfile=response2.text
#print(outfile)
#print(type(outfile)) returns XML as string
# save each XML files to local drive
with open(os.path.join("C:\\Users\\#####\\ProfAPI", str(counter) '.xml'), 'w', encoding="utf-8") as f:
f.write(outfile)
print("File no.", counter, "downloaded!")
print("Done")
# close the browser window
driver.quit()
Extracting the data I need to then carried out by iterating through my local directory, which seemed to be a better solution.