Home > Back-end >  Hidden <a> tags
Hidden <a> tags

Time:02-14

I'm sure there is a better way to write this code but these were the two ways I was trying to pull a specific sentence from a wiki page on Python. When I got to the correct tag path, the tag I was wanting to pull was hidden. Does anyone know why it does not come up when I return all the a tags from the website?

Web Scraping Code

Soup Instance

What I'm trying to pull from the wiki page

    import requests
    from bs4 import BeautifulSoup
    
    req = requests.get('https://en.wikipedia.org/wiki/Python_(programming_language)')
    soup = BeautifulSoup(req.content, 'html.parser')
    body = soup.find('body')

    # first method to get to my desired a tag
    a = body.select('div div div div p sup a', attrs = {'href': '/wiki/Backward_compatibility','title': 'Backward compatability'})
    
    # second method to get to my desired a tag 
    div1 = soup.body.find('div', attrs = {'id': 'content'})
    div2 = div1.find('div', attrs = {'id': 'bodyContent'})
    div3 = div2.find('div', attrs = {'id': 'mw-content-text'})
    div4 = div3.find('div', attrs = {'class': 'mw-parser-output'})
    sup = div4.find('sup', attrs = {'id': 'cite_ref-33'})
    sup.contents
    

    # this show all the a tags on the page but I couldn't find the one that has 'href' = /wiki/Backward_compatibility' shown in the third screenshot above
    # a = body.find_all('a')
   

CodePudding user response:

You can get the whole paragraph with one long css selectors statement...

print(soup.select_one('p:has(a[title="Backward compatibility"][href="/wiki/Backward_compatibility"])').text)

Output:

Guido van Rossum began working on Python in the late 1980s, as a successor to the \ABC programming language, and first released it in 1991 as Python 0.9.0.[33] Python 2.0 was released in 2000 and introduced new features, such as list comprehensions and a cycle-detecting garbage collection system (in addition to reference counting). Python 3.0 was released in 2008 and was a major revision of the language that is not completely backward-compatible. Python 2 was discontinued with version 2.7.18 in 2020.[34]
  • Related