I'm sure there is a better way to write this code but these were the two ways I was trying to pull a specific sentence from a wiki page on Python. When I got to the correct tag path, the tag I was wanting to pull was hidden. Does anyone know why it does not come up when I return all the a tags from the website?
What I'm trying to pull from the wiki page
import requests
from bs4 import BeautifulSoup
req = requests.get('https://en.wikipedia.org/wiki/Python_(programming_language)')
soup = BeautifulSoup(req.content, 'html.parser')
body = soup.find('body')
# first method to get to my desired a tag
a = body.select('div div div div p sup a', attrs = {'href': '/wiki/Backward_compatibility','title': 'Backward compatability'})
# second method to get to my desired a tag
div1 = soup.body.find('div', attrs = {'id': 'content'})
div2 = div1.find('div', attrs = {'id': 'bodyContent'})
div3 = div2.find('div', attrs = {'id': 'mw-content-text'})
div4 = div3.find('div', attrs = {'class': 'mw-parser-output'})
sup = div4.find('sup', attrs = {'id': 'cite_ref-33'})
sup.contents
# this show all the a tags on the page but I couldn't find the one that has 'href' = /wiki/Backward_compatibility' shown in the third screenshot above
# a = body.find_all('a')
CodePudding user response:
You can get the whole paragraph with one long css selectors statement...
print(soup.select_one('p:has(a[title="Backward compatibility"][href="/wiki/Backward_compatibility"])').text)
Output:
Guido van Rossum began working on Python in the late 1980s, as a successor to the \ABC programming language, and first released it in 1991 as Python 0.9.0.[33] Python 2.0 was released in 2000 and introduced new features, such as list comprehensions and a cycle-detecting garbage collection system (in addition to reference counting). Python 3.0 was released in 2008 and was a major revision of the language that is not completely backward-compatible. Python 2 was discontinued with version 2.7.18 in 2020.[34]