How can retrieve the value of a non-standard keyword attribute using regex to match attribute's-CodePudding

In a private project (learning python scripting), i needed to retrieve only the rpm package of the scrapped page. I spotted that all package links (.msi, .deb, .rpm) has an attribute called data-link inside 'a' balise.

I also taylored my own regex (https://regexr.com/6rqd2) to match only the package i need.

According to documentation, it seems that this kind of attribute (data-*) is a non-standard attribute in HTML 5.

So i tried the attrs argument and passed into find_all() but with no success.

Unsuccessfull Code below

#!/usr/bin/env python3

import re

from bs4 import BeautifulSoup


url = "https://www.splunk.com/en_us/download/splunk-enterprise.html"


page = requests.get(url)

soup = BeautifulSoup(page.content, 'html.parser')

pattern = re.compile("(?<=data-link=\")[^ ] rpm")


package = soup.find_all(attrs={"data-link": pattern})

print(package)

Thank you in advance for your help

CodePudding user response：

Do you need all the features beautiful soup provides? The below should find the links as required.

re.findall(pattern, str(page.content))

CodePudding user response：

You don't need to include data-link in your expression because you're searching by a value of the attribute, so you're matching a value only, not a full element:

soup.find_all(
    "a",
    {"data-link": re.compile(r"^(. ?)\.rpm$")},
)

CodePudding user response：

Another solution, using CSS selectors:

import requests
from bs4 import BeautifulSoup


url = "https://www.splunk.com/en_us/download/splunk-enterprise.html"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

for a in soup.select('a[data-link$=".rpm"]'):
    print(a["data-link"])

Prints:

https://download.splunk.com/products/splunk/releases/9.0.0.1/linux/splunk-9.0.0.1-9e907cedecb1-linux-2.6-x86_64.rpm