The BeautifulSoup equivalent I am trying to accomplish is:
page_soup = soup(page_html)
tags = {tag.name for tag in page_soup.find_all()}
tags
How do I do this using Selenium? I'm just trying to print out the unique tags used by a website without having to go through the entire HTML source code, so I can begin analysing it and scrape specific parts of the website. I don't care what the content of the tags are at this point, I just want to know what tags are used.
An answer I've stumbled upon, but not sure if there is a more elegant way of doing things is this...
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
website = 'https://www.afr.com'
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get(website)
el = driver.find_elements(by=By.CSS_SELECTOR, value='*')
tag_list = []
for e in el:
tag_list.append(e.tag_name)
tag_list = pd.Series(tag_list).unique()
for t in tag_list:
print(t)
CodePudding user response:
Beautifulsoup is better for this specific scenario.
But if you still want to use Selenium, you can try:
elems = driver.find_elements_by_tag_name('*')
tags = []
for x in elems:
taggs.append(x.tag_name)
Which is equivalent to:
elems = driver.find_elements_by_tag_name('*')
tags = [x.tag_name for x in elems]
If you finally want to get only the unique values, you could use the set()
built-in data type for example:
set(tags)