Home > OS >  Parsing text from html source code from similarly named blocks
Parsing text from html source code from similarly named blocks

Time:11-15

enter image description here

<dl class="book__details-item">
    <dt class="book__details-name">
       Место издания:
    </dt>
    <dt class="book__details-value">
       Москва
    </dt>
</dl>
<dl class="book__details-item">
    <dt class="book__details-name">
    Издательство:
    </dt>
    <dt class="book__details-value">
        <a href="/publishers/5558/" target="blank">Манн, Иванов и Фербер</a>
    </dt>
</dl>
<dl class="book__details-item">
    <dt class="book__details-name">
      Год издания:
    </dt>
    <dt class="book__details-value">
      2021
    </dt>
    </dl>
    <dt class="book__details-name">
      Год издания:
    </dt>
     <dt class="book__details-value">
     2021
     </dt>

Hi. Here i have a bookstore website. I need to get the publishing year out, but i can't get to it, cut every element of books description is put under a similarly named blocks.

def get_html(url, params=None):
    r = requests.get(url, headers = HEADERS, params = params)
    return r

def get_content(html): # Here's a part, where it gets confusing
    years = []
    soup = BeautifulSoup(html, "html.parser")
    items = soup.find("div", class_="book__details-left")
    smalleritems = items.find("dl", class_="book__details-item")
    smalleritems = smalleritems.find("dt", class_="book__details-value")
    smalleritems = smalleritems.get_text()
    print(smalleritems)

def parse(URL):
    html = get_html(URL)
    if html.status_code == 200:
        midlinklist = get_content(html.text)
        return midlinklist
    else:
        print("Error")

for URL in final_linklist:
    print (str(URL))
    print("Парсинг", page, "страниц из", len(final_linklist) - page)
    page = page   1
    midl = parse(str(URL))
    for pubs in midl:
        final_publist.append(pubs)

My code is not finished, because i can't quite get to the text "2021" under

<dt class="book__details-value">
        2021
  </dt>

CodePudding user response:

You can get the last tag under the class book__details-value using [-1] indexing:

soup = BeautifulSoup(html, "html.parser")
print(soup.find_all("dt", class_="book__details-value")[-1].get_text(strip=True))

Output:

2021

CodePudding user response:

I would go by anchoring to the correct preceding dt by its class and the text it contains i.e. Год издания: or translated The year of publishing:. Once matched on that, use an adjacent sibling combinator ( ) to move to the adjacent element which has class book__details-value:

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('<individual book url>')
soup = bs(r.content, 'lxml')
print(int(soup.select_one('.book__details-name:-soup-contains("Год издания:")   .book__details-value').text.strip()))
  • Related