How to parse tag attributes with BeautifulSoup4?-CodePudding

I am wondering how could I use BeautifulSoup to parse this style (javascript?) of tag attributes in the following html code:

<div class="class1" data-prop="{personName: 'Claudia', personCode:'123456'}">
...
</div>

I'm currently just following standard process until I reach the contents of the attribute which I am currently parsing using regexp, however I'd like to know if there are better/faster/more elegant options:

soup = BeautifulSoup(data,'html.parser')
class_element = soup.find("div", class_="class1")
data-props=class_element['data-prop']
# Parsing using regexp goes here

CodePudding user response：

I wouldn't say that this is a faster approach than regexp, but the one that probably takes less lines of code:
To turn this string into a python dict

data_props = "{personName: 'Claudia', personCode:'123456'}"

data_as_dict_str = "dict("   data_props[1:-1].replace(":", "=")   ")"

print(eval(data_as_dict_str))
# {'personName': 'Claudia', 'personCode': '123456'}

If this attribute contains malicious Python code, it will be executed (in the eval)!
And we also cannot use the safe ast.literal_eval, since it will not allow the name dict to be called

If we wanna use ast.literal_eval or json then we need to transform this string in a way that all first names will have quotes around them and at this point it'll be easier to only use regexp:

import re

pattern = re.compile(r"(\b\w \b)\s*:\s*'([^'] )'")

data_props = "{personName: 'Claudia', personCode:'123456'}"

print(dict(pattern.findall(data_props)))
# {'personName': 'Claudia', 'personCode': '123456'}

CodePudding user response：

If the issue is simply handling unquoted keys then you can use hjson library. I have no idea about the under-the-hood efficiencies, how regex is used within the parser etc, but it is nice and simple to use at the top level:

import hjson

data = hjson.loads(soup.select_one('.class1')['data-prop'])