Home > Software engineering >  How can retrieve the value of a non-standard keyword attribute using regex to match attribute's
How can retrieve the value of a non-standard keyword attribute using regex to match attribute's

Time:08-13

In a private project (learning python scripting), i needed to retrieve only the rpm package of the scrapped page. I spotted that all package links (.msi, .deb, .rpm) has an attribute called data-link inside 'a' balise.

I also taylored my own regex (https://regexr.com/6rqd2) to match only the package i need.

According to documentation, it seems that this kind of attribute (data-*) is a non-standard attribute in HTML 5.

So i tried the attrs argument and passed into find_all() but with no success.

Unsuccessfull Code below

#!/usr/bin/env python3

import re

from bs4 import BeautifulSoup


url = "https://www.splunk.com/en_us/download/splunk-enterprise.html"


page = requests.get(url)

soup = BeautifulSoup(page.content, 'html.parser')

pattern = re.compile("(?<=data-link=\")[^ ] rpm")


package = soup.find_all(attrs={"data-link": pattern})

print(package)

Thank you in advance for your help

CodePudding user response:

Do you need all the features beautiful soup provides? The below should find the links as required.

re.findall(pattern, str(page.content))

CodePudding user response:

You don't need to include data-link in your expression because you're searching by a value of the attribute, so you're matching a value only, not a full element:

soup.find_all(
    "a",
    {"data-link": re.compile(r"^(. ?)\.rpm$")},
)

CodePudding user response:

Another solution, using CSS selectors:

import requests
from bs4 import BeautifulSoup


url = "https://www.splunk.com/en_us/download/splunk-enterprise.html"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

for a in soup.select('a[data-link$=".rpm"]'):
    print(a["data-link"])

Prints:

https://download.splunk.com/products/splunk/releases/9.0.0.1/linux/splunk-9.0.0.1-9e907cedecb1-linux-2.6-x86_64.rpm
  • Related