BEAUTIFULSOUP : how to get tag with given string without css selector-CodePudding

im currently learning how to scrape webpages.

THE PROBLEM: I cant use css selector, because on other sites the postion (order) of this tag (information about estimated start time) changes.

MY GOAL: How can I retrieve the information: January 2022

HTML-SNIPPET:

<tr>
    <td headers="studyInfoColTitle">  Estimated <span style="display:inline;"  data-term="Study Start Date" title="Show definition">Study Start Date <i  aria-hidden="true" data-term="Study Start Date" style="border-bottom-style:none;"></i></span> : 
    </td>
    <td headers="studyInfoColData" style="padding-left:1em">January 2022</td>
</tr>

WHAT I HAVE TRIED:

1.) I tried to declare a func to filter out (combined with find_all) this tag:

def searchMethod(tag):
        return re.compile("Estimated") and (str(tag.string).find("Estimated") > -1)
#calling here above func
foundTag_s = soup.find_all(searchMethod)

this helped me for other similar cases, but here it didnt work, I think it has to do with how the stringtext is devided between the tags...

2.) I tried to use the string search:

starttime_elem = soup.find("td", string="Estimated")

but it doesnt work for some reason.

After many hours of searching I decided to ask here.

Ref: https://clinicaltrials.gov/ct2/show/NCT05169372?draw=2&rank=1

CodePudding user response：

So, you are actually looking at different pages within the same domain. The html is basically consistent in terms of elements and attributes.

CSS selector lists are a lot more versatile than just for positional matching. There are numerous ways to solve your current problem.

One is simply to use a css attribute = value css selector to target the start date node then move to the next td

import requests
from bs4 import BeautifulSoup as bs

links = ['https://clinicaltrials.gov/ct2/show/NCT05169372?draw=2&rank=1', 'https://clinicaltrials.gov/ct2/show/NCT05169359?draw=2&rank=2']

with requests.Session() as s:
    
    for link in links:
        
        r = s.get(link, headers = {'User-Agent':'Mozilla/5.0'})
        soup = bs(r.content, 'lxml')
        start = soup.select_one('[data-term="Study Start Date"]')

        if start is not None:
            
            print(start.text)
            print(start.find_next('td').text)

This is a robust and consistent attribute.

You could also use :-soup-contains:

start = soup.select_one('.term:-soup-contains("Study Start Date")')