How do I get a parent attribute (e.g. a link) when the distinguishing information is in a later chil-CodePudding

Using Selenium (Python) to avoid spoilers of a soccer game

I am trying to grab the url for a video of soccer match replay from a dynamically changing webpage. The webpage shows the score and I'd rather get the link directly, rather than visiting the website that almost certainly will show me the score. There are other related videos of the match, like 10 minute highlight reel. But I would like the full replay only.

There is a list of videos on the page to choose from. But the 'h1' heading indicating it's a full replay is wrapped inside the 'a' tag (see below). There are ~10 of these list items on the page but they are distinguished only from the content of 'h1', buried as child. The text that I'm after Brentford v LFC : Full match. The "full match" part is the give away.

My problem is how do I get the link when the important information comes in a later child??

<li data-sidebar-video="0_5de4sioh" class="js-subscribe-entitlement">
  <a class="" href="//video.liverpoolfc.com/player/0_5de4sioh/">
    <article class="video-thumb video-thumb--fade-in js-thumb video-thumb--no-duration video-thumb--sidebar">
      <figure class="video-thumb__img">
        <div class="site-loader">
          <ul>
            <li></li>
            <li></li>
            <li></li>
          </ul>
        </div> <img class="video-thumb__img-container loaded" data-src="//open.http.mp.streamamg.com/p/101/thumbnail/entry_id/0_5de4sioh/width/150/height/90/type/3" alt="Brentford v LFC : Full match" onerror="PULSE.app.common.VideoThumbError(this)" onload="PULSE.app.common.VideoThumbLoaded(this)"
          src="//open.http.mp.streamamg.com/p/101/thumbnail/entry_id/0_5de4sioh/width/150/height/90/type/3" data-image-initialised="true"> <span class="video-thumb__premium">Premium</span> <i class="video-thumb__play-btn"></i> <span class="video-thumb__time"> <i class="video-thumb__icon"></i> 1:45:07 </span>        </figure>
      <div class="video-thumb__txt-container"> <span class="video-thumb__tag js-video-tag">Match Action</span>
        <h1 class="video-thumb__heading">Brentford v LFC : Full match</h1> <time class="video-thumb__date">25th Sep 2021</time> </div>
    </article>
  </a>
</li>

My code looks like this at the moment. It gives me a list of the links but I don't know which one is which.

from selenium import webdriver

#------------------------Account login---------------------------#
#I have to login to my account first. 
#----------------------------------------------------------------#

username = "<my username goes here>"
password = "<my password goes here>"
username_object_id = "login_form_username"
password_object_id = "login_form_password"
login_button_name = "submitBtn"
login_url = "https://video.liverpoolfc.com/mylfctvgo"
driver = webdriver.Chrome("/usr/local/bin/chromedriver")
driver.get(login_url)
driver.implicitly_wait(10)
driver.find_element_by_id(username_object_id).send_keys(username)
driver.find_element_by_id(password_object_id).send_keys(password)
driver.find_element_by_name(login_button_name).click()

#--------------Find most recent game played----------------#
#I have to go to the matches section of my account and click on the most recent game
#----------------------------------------------------------------#
matches_url = "https://video.liverpoolfc.com/matches"
driver.get(matches_url)
driver.implicitly_wait(10)
latest_game = driver.find_element_by_xpath("/html/body/div[2]/section/ul/li[1]/section/div/div[1]/a").get_attribute('href')
driver.get(latest_game)
driver.implicitly_wait(10)

#--------------Find the full replay video----------------#
#There are many videos to choose from but I only want the full replay.
#--------------------------------------------------#

#prints all the videos in the list. They all have the same "data-sidebar-video" attribute 
web_element1 = driver.find_elements_by_css_selector('li[data-sidebar-video*=""] > a')

print(web_element1)

for i in web_element1:
    print(i.get_attribute('href'))

CodePudding user response：

You can use driver.execute_script to grab only the links that have the "Full match" designation as a child:

links = driver.execute_script('''
 var links = [];
 for (var i of document.querySelectorAll('li[data-sidebar-video*=""] > a')){
    if (i.querySelector('h1.video-thumb__heading').textContent.endsWith('Full match')){
        links.push(i.getAttribute('href'));
    }
 }
 return links;
''')

CodePudding user response：

You can try like below.

Extract the list of videos with li tags, check if the h1 tag inside the respective list has Full match if yes get the a tag with its href.

# Imports Required:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait

driver.get("https://video.liverpoolfc.com/player/0_5j5fsdzg/?contentReferences=FOOTBALL_FIXTURE:g2210322&page=0&pageSize=20&sortOrder=desc&title=Highlights: Brentford 3-3 LFC&listType=LIST-DEFAULT")
wait = WebDriverWait(driver,30)

wait.until(EC.visibility_of_element_located((By.XPATH,"//ul[contains(@class,'related-videos')]/li")))
videos = driver.find_elements_by_xpath("//ul[contains(@class,'related-videos')]/li")

for video in videos:
    option = video.find_element_by_tag_name("h1").get_attribute("innerText")
    if "Full match" in option:
        link = video.find_element_by_tag_name("a").get_attribute("href")
        print(f"{option} : {link}")

Brentford v LFC : Full match : https://video.liverpoolfc.com/player/0_5de4sioh/

CodePudding user response：

You can do this with a simple XPath locator since you are searching based on contained text.

//a[.//h1[contains(text(),'Full match')]]
^ an A tag
   ^ that has an H1 descendant
         ^ that contains the text "Full match"

NOTE: You can't just get the href from the A tag since it isn't a complete URL, e.g. //video.liverpoolfc.com/player/0_5de4sioh/. I would suggest you just click on the link. If you want to write it to a file, you'll have to append "https:" to the front of these partial URLs to make them usable.