Home > OS >  How to use Selenium Python to get a field information of each linked page
How to use Selenium Python to get a field information of each linked page

Time:01-18

The context is springerlink. For example this series of books enter image description here

So we can get the EISBN codes directly from those urls, without the need to load a new page for each book:

import requests
from bs4 import BeautifulSoup

soup = BeautifulSoup(requests.get(url).text, "html.parser")
titles = [title.text.strip() for title in soup.select('.c-card__title')]
EISBN = []
for a in soup.select('ul:last-child .c-meta__item:last-child a'):
    c = a['href'].split('/')[-1] # a['href'] is something like https://www.springer.com/book/9783031256325
    EISBN.append( f'{c[:3]}-{c[3]}-{c[4:7]}-{c[7:12]}-{c[-1]}' ) # insert four '-' in the number 9783031256325 to create the E-ISBN code

Output

978-3-031-25632-5 Random Walks on Infinite Groups
978-3-031-19707-9 Drinfeld Modules
978-3-031-13379-4 Partial Differential Equations
978-3-031-00943-3 Stationary Processes and Discrete Parameter Markov Processes
978-3-031-14205-5 Measure Theory, Probability, and Stochastic Processes
978-3-030-56694-4 Quaternion Algebras
978-3-030-73839-6 Mathematical Logic
978-3-030-71250-1 Lessons in Enumerative Combinatorics
978-3-030-35118-2 Basic Representation Theory of Algebras
978-3-030-59242-4 Ergodic Dynamics

Method 2 (slower): get E-ISBN by loading a page for each book

This method load the details page for each book and extract from there the EISBN code:

import requests, re
from bs4 import BeautifulSoup

url = 'https://www.springer.com/series/136/books'
soup = BeautifulSoup(requests.get(url).text, "html.parser")
books = soup.select('a[data-track-label^="article"]')
titles, EISBN = [], []

for book in books:
    titles.append(book.text.strip())
    soup_book = BeautifulSoup(requests.get(book['href']).text, "html.parser")
    EISBN.append( soup_book.select('p:has(span[data-test=electronic_isbn_publication_date]) .c-bibliographic-information__value')[0].text )

for i in range(len(titles)):
    print(EISBN[i],titles[i])

If you are wondering p:has(span[data-test=electronic_isbn_publication_date]) select the parent p of the span having attribute data-test=electronic_isbn_publication_date.

  • Related