im currently learning how to scrape webpages.
THE PROBLEM: I cant use css selector, because on other sites the postion (order) of this tag (information about estimated start time) changes.
MY GOAL: How can I retrieve the information: January 2022
HTML-SNIPPET:
<tr>
<td headers="studyInfoColTitle"> Estimated <span style="display:inline;" data-term="Study Start Date" title="Show definition">Study Start Date <i aria-hidden="true" data-term="Study Start Date" style="border-bottom-style:none;"></i></span> :
</td>
<td headers="studyInfoColData" style="padding-left:1em">January 2022</td>
</tr>
WHAT I HAVE TRIED:
1.) I tried to declare a func to filter out (combined with find_all) this tag:
def searchMethod(tag):
return re.compile("Estimated") and (str(tag.string).find("Estimated") > -1)
#calling here above func
foundTag_s = soup.find_all(searchMethod)
this helped me for other similar cases, but here it didnt work, I think it has to do with how the stringtext is devided between the tags...
2.) I tried to use the string search:
starttime_elem = soup.find("td", string="Estimated")
but it doesnt work for some reason.
After many hours of searching I decided to ask here.
Ref: https://clinicaltrials.gov/ct2/show/NCT05169372?draw=2&rank=1
CodePudding user response:
So, you are actually looking at different pages within the same domain. The html is basically consistent in terms of elements and attributes.
CSS selector lists are a lot more versatile than just for positional matching. There are numerous ways to solve your current problem.
One is simply to use a css attribute = value css selector to target the start date node then move to the next td
import requests
from bs4 import BeautifulSoup as bs
links = ['https://clinicaltrials.gov/ct2/show/NCT05169372?draw=2&rank=1', 'https://clinicaltrials.gov/ct2/show/NCT05169359?draw=2&rank=2']
with requests.Session() as s:
for link in links:
r = s.get(link, headers = {'User-Agent':'Mozilla/5.0'})
soup = bs(r.content, 'lxml')
start = soup.select_one('[data-term="Study Start Date"]')
if start is not None:
print(start.text)
print(start.find_next('td').text)
This is a robust and consistent attribute.
You could also use :-soup-contains
:
start = soup.select_one('.term:-soup-contains("Study Start Date")')