I tried to get the sqlite download link on the sqlite download webpage using BeautifulSoup.
I can see the correct href value when inspecting the webpage in the chrome.
However, I cannot get the href value using python like the code.
import urllib.request
import re
from bs4 import BeautifulSoup
url = "https://www.sqlite.org/download.html"
data = urllib.request.urlopen(url).read()
parsed_html = BeautifulSoup(data, 'html.parser')
link_tags = parsed_html.find_all('a')
pattern = "sqlite-autoconf-(\d ).tar.gz"
pattern_regex = re.compile(pattern)
download_link_tag = next(a for a in link_tags if pattern_regex.match(a.text))
print(download_link_tag.get('href'))
Result is
hp1.html
May I know how to solve this problem?
Thanks!
CodePudding user response:
The sqlite website has HTML that defines the href for that tag as "hp1.html". That's why you are seeing that as the result. Once the page loads there is some javascript that replaces the href values for a number of different tags by locating the tags by id and then replacing the href.
What you could do is first find the tag you are looking for, the way you are, and then get that tag's id. Once you have the tag's id use a regex to search for what href that id should get.
import urllib.request
import re
from bs4 import BeautifulSoup
url = "https://www.sqlite.org/download.html"
data = urllib.request.urlopen(url).read()
parsed_html = BeautifulSoup(data, 'html.parser')
link_tags = parsed_html.find_all('a')
pattern = "sqlite-autoconf-(\d ).tar.gz"
pattern_regex = re.compile(pattern)
download_link_tag = next(a for a in link_tags if
pattern_regex.match(a.text))
# get the id of the tag where the href you want will go
tag_id = download_link_tag.get('id')
# This pattern will find the href that will replace the current href for a given tag_id
replace_pattern = re.compile(f"'{tag_id}','([^']*)'")
# Now we use our pattern to find all instances where our href is replaced with a new one
print(replace_pattern.findall(str(parsed_html)))
This will print:
['2022/sqlite-autoconf-3390200.tar.gz']
CodePudding user response:
You can always do it like this:
from bs4 import BeautifulSoup
import requests
import re
url = "https://www.sqlite.org/download.html"
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
dl_url = soup.select('script')[-1].text.split('d391')[-1].split("');")[0].split(",'")[1]
print('https://www.sqlite.org/' dl_url)
This will return:
https://www.sqlite.org/2022/sqlite-autoconf-3390200.tar.gz
You can use url id's to experiment with getting various links & their names/descriptions, if you want, and then split that script tag's content accordingly. BeautifulSoup docs can be found at https://beautiful-soup-4.readthedocs.io/en/latest/index.html