In a private project (learning python scripting), i needed to retrieve only the rpm package of the scrapped page. I spotted that all package links (.msi, .deb, .rpm) has an attribute called data-link inside 'a' balise.
I also taylored my own regex (https://regexr.com/6rqd2) to match only the package i need.
According to documentation, it seems that this kind of attribute (data-*) is a non-standard attribute in HTML 5.
So i tried the attrs argument and passed into find_all() but with no success.
Unsuccessfull Code below
#!/usr/bin/env python3
import re
from bs4 import BeautifulSoup
url = "https://www.splunk.com/en_us/download/splunk-enterprise.html"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
pattern = re.compile("(?<=data-link=\")[^ ] rpm")
package = soup.find_all(attrs={"data-link": pattern})
print(package)
Thank you in advance for your help
CodePudding user response:
Do you need all the features beautiful soup provides? The below should find the links as required.
re.findall(pattern, str(page.content))
CodePudding user response:
You don't need to include data-link
in your expression because you're searching by a value of the attribute, so you're matching a value only, not a full element:
soup.find_all(
"a",
{"data-link": re.compile(r"^(. ?)\.rpm$")},
)
CodePudding user response:
Another solution, using CSS selectors:
import requests
from bs4 import BeautifulSoup
url = "https://www.splunk.com/en_us/download/splunk-enterprise.html"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
for a in soup.select('a[data-link$=".rpm"]'):
print(a["data-link"])
Prints:
https://download.splunk.com/products/splunk/releases/9.0.0.1/linux/splunk-9.0.0.1-9e907cedecb1-linux-2.6-x86_64.rpm