My entry (The variable is of string type):
<a href="https://wikipedia.org/" rel="nofollow ugc">wiki</a>
My expected output:
{
'href': 'https://wikipedia.org/',
'rel': 'nofollow ugc',
'text': 'wiki',
}
How can I do this with Python? Without using beautifulsoup Library
Please tell with the help of lxml library
CodePudding user response:
Solution with lxml (but without bs!):
from lxml import etree
xml = '<a href="https://wikipedia.org/" rel="nofollow ugc">wiki</a>'
root = etree.fromstring(xml)
print(root.attrib)
>>> {'href': 'https://wikipedia.org/', 'rel': 'nofollow ugc'}
But there's no text
attribute.
You can extract it by using text
property:
print(root.text)
>>> 'wiki'
To conclusion:
from lxml import etree
xml = '<a href="https://wikipedia.org/" rel="nofollow ugc">wiki</a>'
root = etree.fromstring(xml)
dict_ = {}
dict_.update(root.attrib)
dict_.update({'text': root.text})
print(dict_)
>>> {'href': 'https://wikipedia.org/', 'rel': 'nofollow ugc', 'text': 'wiki'}
EDIT
-------regex parsing [X]HTML is deprecated!-------
Solution with regex:
import re
pattern_text = r"[>](\w )[<]"
pattern_href = r'href="(\w\S )"'
pattern_rel = r'rel="([A-z ] )"'
xml = '<a href="https://wikipedia.org/" rel="nofollow ugc">wiki</a>'
dict_ = {
'href': re.search(pattern_href, xml).group(1),
'rel': re.search(pattern_rel, xml).group(1),
'text': re.search(pattern_text, xml).group(1)
}
print(dict_)
>>> {'href': 'https://wikipedia.org/', 'rel': 'nofollow ugc', 'text': 'wiki'}
It will work if input is string.
CodePudding user response:
While using BeautifulSoup
you could use .attrs
to get a dict
of of a tags attributes:
from bs4 import BeautifulSoup
soup = BeautifulSoup('<a href="https://wikipedia.org/" rel="nofollow ugc">wiki</a>')
soup.a.attrs
--> {'href': 'https://wikipedia.org/', 'rel': ['nofollow', 'ugc']}
To get also the text:
...
data = soup.a.attrs
data.update({'text':soup.a.text})
print(data)
--> {'href': 'https://wikipedia.org/', 'rel': ['nofollow', 'ugc'], 'text': 'wiki'}
CodePudding user response:
This is how you do it with lxml:
from lxml import etree
html = '''<a href="https://wikipedia.org/" rel="nofollow ugc">wiki</a>'''
root = etree.fromstring(html)
attrib_dict = root.attrib
attrib_dict['text'] = root.text
print(attrib_dict)
Result:
{'href': 'https://wikipedia.org/', 'rel': 'nofollow ugc', 'text': 'wiki'}