Get the <span> value for each webpage with Selenium in Python-CodePudding

I have a list of websites that I want to loop through and extract the Genres of the films. They all come from boxofficemojo.

An example link is the following: https://www.boxofficemojo.com/release/rl3829564929/

In the inspect, the structure of the page for the section that I want to extract is like this:

<div class = "a-section a-spacing-none">
   <span>Genres</span>
 </div>
  <span>
  Action Adventure Thriller
 </span>

When I run the following code:

driver = webdriver.Chrome("C:\SeleniumDrivers\chromedriver.exe")
driver.get("https://www.boxofficemojo.com/release/rl3829564929/")
driver.implicitly_wait(3)
my_element = driver.find_element_by_xpath("/html/body/div[1]/main/div/div[3]/div[4]/div[7]/span[2]") 
my_element.text

I get the following results:

'Action Adventure Thriller'

which is the desirable result for this particular movie. However,when I go to other movies, the xpath is different and I cannot access it automatically.

The ideal solution would be to loop through the websites and extract the genres of the films irrespective of the xpath that the Genres has in each individual film page.

CodePudding user response：

my_element = driver.find_element_by_xpath("//div[@class='a-section a-spacing-none' and contains(.,'Genres')]/span[2]") 
my_element.text

Search for a more unique xpath that fits your perimeters.

CodePudding user response：

For each relevant website you visit, if the label Genre is followed by the genre name within a <span> tag, to extract the text you can use either of the following Locator Strategies:

Using xpath:

driver.get("https://www.boxofficemojo.com/release/rl3829564929/")
print(driver.find_element_by_xpath("//span[text()='Genres']//following::span[1]").text)

Ideally you need to induce WebDriverWait for the visibility_of_element_located() and you can use either of the following Locator Strategies:

Using XPATH:

driver.get("https://www.boxofficemojo.com/release/rl3829564929/")
print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//span[text()='Genres']//following::span[1]"))).get_attribute("innerHTML"))

Console Output:
```
Action Adventure Thriller
```

Note : You have to add the following imports :

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC