I'm trying to scrape multiple elements from a website but the layout is unfortunately not very friendly. This is the link. What I would like is to scrape the link, name and last updated fields of the dataset of the dataset and output into a dictionary which will later be output into a JSON file. This is an example of the page source for a single dataset:
<article role="article" >
<a href="/natural-spaces/dataset.jsp?code=DMG" rel="bookmark">
<div >
<div >
<h3 ><span>Deer Management Group boundaries</span></h3>
<div >
<p style="font-weight:bold;">Last update: 2021-11-25</p>
Voluntary Deer Management Groups (DMGs) exist across most of Scotland’s red deer range. The memberships of these groups comprise representatives from landholdings within the group’s area. The diver...
</div>
</div>
</div>
</a>
</article>
So how do I extract the three above mentioned properties when they are nested so much into each other. So far ignoring all the imports, my code looks like this:
cat_link = 'https://cagmap.snh.gov.uk/natural-spaces/category.jsp?code=ad'
driver.get(cat_link)
datasets = driver.find_elements(By.XPATH, '//*[@id="content0"]/div/section[2]/div/div/article/a')
for dataset in datasets:
dataset_link = dataset.get_attribute('href')
dataset_title = dataset.get_attribute('h3')
dataset_last_updated = dataset.get_attribute('p')
When I write a print statement for the three elements, I only get a response for dataset_link
while dataset_title
and dataset_last_updated
return None
.
Your help is highly appreciated
CodePudding user response:
href,class,id,etc. are the attributes of a particular tag , but h3 and p are not attributes they are tags.
for dataset in datasets: dataset_link = dataset.get_attribute('href') dataset_title = dataset.find_element_by_xpath("//h3[@class='c-teaser__header']/span") dataset_last_updated = dataset.find_element_by_xpath("//div[@class='c-teaser__text']/p")
You can find_elements_by_xpath , by_id, by_class to get the relevant tags.
CodePudding user response:
Since tag name article
seems to be suitable tag to use to represent the individual datasets section, we could use that to search within the expected XPath location.
For the relevant document location, root is at the enclosing div
:
dataset_root = driver.find_element(By.XPATH,'//*[@id="content0"]/div/section[2]/div/div')
We can now search for all elements using tag name article
against identified root, as below:
articles = dataset_root.find_elements(By.TAG_NAME,"article")
Once we have the relevant elements, iterate through each article
element, find tag a
as well as all children of this article
and display the required details.
Here is the full code:
from selenium import webdriver
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get('https://cagmap.snh.gov.uk/natural-spaces/category.jsp?code=ad')
dataset_root = driver.find_element(By.XPATH,'//*[@id="content0"]/div/section[2]/div/div')
# find all article elements
articles = dataset_root.find_elements(By.TAG_NAME,"article")
for article in articles:
# find tag "a", if present display href's value
if (a_tag := article.find_element(By.TAG_NAME,'a')) is not None:
print(a_tag.get_property('href'))
all_children = article.find_elements(By.XPATH,".//*")
# display header
print(all_children[3].text)
# display "last updated"
print(all_children[6].text)
print('*'*25)
Here is the complete output from above:
https://cagmap.snh.gov.uk/natural-spaces/dataset.jsp?code=DMG
Deer Management Group boundaries
Last update: 2021-11-25
*************************
https://cagmap.snh.gov.uk/natural-spaces/dataset.jsp?code=NCO
Nature Conservation Order
Last update: 2022-07-28
*************************
https://cagmap.snh.gov.uk/natural-spaces/dataset.jsp?code=SNHAREAS
SNH Area Boundaries (after April 2011)
Last update: 2014-05-01
*************************
https://cagmap.snh.gov.uk/natural-spaces/dataset.jsp?code=SNHLAND
SNH Owned Rural Land
Last update: 2022-07-07
*************************