I am wondering how could I use BeautifulSoup to parse this style (javascript?) of tag attributes in the following html code:
<div class="class1" data-prop="{personName: 'Claudia', personCode:'123456'}">
...
</div>
I'm currently just following standard process until I reach the contents of the attribute which I am currently parsing using regexp, however I'd like to know if there are better/faster/more elegant options:
soup = BeautifulSoup(data,'html.parser')
class_element = soup.find("div", class_="class1")
data-props=class_element['data-prop']
# Parsing using regexp goes here
CodePudding user response:
I wouldn't say that this is a faster approach than regexp, but the one that probably takes less lines of code:
To turn this string into a python dict
data_props = "{personName: 'Claudia', personCode:'123456'}"
data_as_dict_str = "dict(" data_props[1:-1].replace(":", "=") ")"
print(eval(data_as_dict_str))
# {'personName': 'Claudia', 'personCode': '123456'}
If this attribute contains malicious Python code, it will be executed (in the eval
)!
And we also cannot use the safe ast.literal_eval
, since it will not allow the name dict
to be called
If we wanna use ast.literal_eval
or json
then we need to transform this string in a way that all first names will have quotes around them and at this point it'll be easier to only use regexp:
import re
pattern = re.compile(r"(\b\w \b)\s*:\s*'([^'] )'")
data_props = "{personName: 'Claudia', personCode:'123456'}"
print(dict(pattern.findall(data_props)))
# {'personName': 'Claudia', 'personCode': '123456'}
CodePudding user response:
If the issue is simply handling unquoted keys then you can use hjson library. I have no idea about the under-the-hood efficiencies, how regex is used within the parser etc, but it is nice and simple to use at the top level:
import hjson
data = hjson.loads(soup.select_one('.class1')['data-prop'])