Home > Enterprise >  Cannot get correct href value when crawling sqlite website using BeautifulSoup in python
Cannot get correct href value when crawling sqlite website using BeautifulSoup in python

Time:08-19

I tried to get the sqlite download link on the sqlite download webpage using BeautifulSoup.

I can see the correct href value when inspecting the webpage in the chrome.

screenshot for the webpage

However, I cannot get the href value using python like the code.

import urllib.request
import re
from bs4 import BeautifulSoup

url = "https://www.sqlite.org/download.html"
data = urllib.request.urlopen(url).read()
parsed_html = BeautifulSoup(data, 'html.parser')
link_tags = parsed_html.find_all('a')
pattern = "sqlite-autoconf-(\d ).tar.gz"
pattern_regex = re.compile(pattern)
download_link_tag = next(a for a in link_tags if pattern_regex.match(a.text))
print(download_link_tag.get('href'))

Result is

hp1.html

May I know how to solve this problem?

Thanks!

CodePudding user response:

The sqlite website has HTML that defines the href for that tag as "hp1.html". That's why you are seeing that as the result. Once the page loads there is some javascript that replaces the href values for a number of different tags by locating the tags by id and then replacing the href.

What you could do is first find the tag you are looking for, the way you are, and then get that tag's id. Once you have the tag's id use a regex to search for what href that id should get.

import urllib.request
import re
from bs4 import BeautifulSoup

url = "https://www.sqlite.org/download.html"
data = urllib.request.urlopen(url).read()
parsed_html = BeautifulSoup(data, 'html.parser')
link_tags = parsed_html.find_all('a')
pattern = "sqlite-autoconf-(\d ).tar.gz"
pattern_regex = re.compile(pattern)
download_link_tag = next(a for a in link_tags if 
pattern_regex.match(a.text))

# get the id of the tag where the href you want will go
tag_id = download_link_tag.get('id')

# This pattern will find the href that will replace the current href for a given tag_id
replace_pattern = re.compile(f"'{tag_id}','([^']*)'")

# Now we use our pattern to find all instances where our href is replaced with a new one
print(replace_pattern.findall(str(parsed_html)))

This will print:

['2022/sqlite-autoconf-3390200.tar.gz']

CodePudding user response:

You can always do it like this:

from bs4 import BeautifulSoup
import requests
import re

url = "https://www.sqlite.org/download.html"
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
dl_url = soup.select('script')[-1].text.split('d391')[-1].split("');")[0].split(",'")[1]
print('https://www.sqlite.org/'   dl_url)

This will return:

https://www.sqlite.org/2022/sqlite-autoconf-3390200.tar.gz

You can use url id's to experiment with getting various links & their names/descriptions, if you want, and then split that script tag's content accordingly. BeautifulSoup docs can be found at https://beautiful-soup-4.readthedocs.io/en/latest/index.html

  • Related