Home > Software engineering >  Extract data from HTML to Dictionnary
Extract data from HTML to Dictionnary

Time:10-05

I want extract data from a web page, but in an organized way. That is, I want the text to be stored in a dictionary whose keys are the names of the classes and the value is a list that contains the different texts of their classes. Is this possible ? I tried with Selinum and BeautifulSoup, but I did not find the solution! Do you have any advice for me ?

exalmple :

<div class="a1">
    ...
    <H1 class="title">1_title<H1>
    ...
        <a class="text">this is a text 1<a>
    ...
    <a class="text">this is a text 2<a>
    <H2 class="title">2_title<H2>
    ...
        <p class="note"> this is a note<p>
<div>

output:

dict = {
    "title" : ["1_title", "2_title"]
    "text"  : ["this is a text 1", "this is a text 2"]
    "note"  : ["this is a note"]
}

CodePudding user response:

There are may ways to do that but best practice is to use either css or xpath selectors.

In python popular package for this is either beautifulsoup4 or parsel. Here's code example using parsel package and xpath selectors:

from parsel import Selector

body = """
<div class="a1">
    <h1 class="title">1_title</h1>
    <a class="text">this is a text 1</a>
    <a class="text">this is a text 2</a>
    <h2 class="title">2_title</h2>
    <p class="note"> this is a note</p>
</div>
"""

selector = Selector(body)
result = {
    "title": selector.xpath('//*[self::h1 or self::h2]/text()').extract(),
    "text": selector.xpath('//a/text()').extract(),
    "note": selector.xpath('//p/text()').extract(),
}

print(result)

Which results in:

{'title': ['1_title', '2_title'], 'text': ['this is a text 1', 'this is a text 2'], 'note': [' this is a note']}

CodePudding user response:

Without any external library. The idea is to "look" at the HTML as XML and use built in parser in order to extract the data.

import xml.etree.ElementTree as ET
from collections import defaultdict
body = """
<div class="a1">
    <h1 class="title">1_title</h1>
    <a class="text">this is a text 1</a>
    <a class="text">this is a text 2</a>
    <h2 class="title">2_title</h2>
    <div><a class="xyz">text3</a></div>
    <p class="note"> this is a note!!</p>
    <p>no class</p>
</div>
"""

data = defaultdict(list)
root = ET.fromstring(body)
for e in root.findall('.//*[@class]'):
  data[e.attrib['class']].append(e.text)
for k,v in data.items():
  print(f'{k} -> {v}')

output

    title -> ['1_title', '2_title']
    text -> ['this is a text 1', 'this is a text 2']
    xyz -> ['text3']
    note -> [' this is a note!!']

CodePudding user response:

elements=driver.find_elements_by_xpath("//*")
d={}
for ele in elements:
    if ele.get_attribute('class'):
        d[ele.get_attribute('class')]=ele.text
print(d)
  • Related