Home > Software design >  Iterating through an html dictionary to scrape content(td and adjacent element) from each html conta
Iterating through an html dictionary to scrape content(td and adjacent element) from each html conta

Time:12-10

I need to iterate through every html in the given data dictionary to scrape for the td element containing "Ένδικα Μέσα" and the content of its adjacent cell. Thank you.

This is the code that I am working on :

from bs4 import BeautifulSoup
import requests

URL = 'https://www.epant.gr/apofaseis-gnomodotiseis/itemlist/category/78-2021.html'

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36", 
    "X-Amzn-Trace-Id": "Root=1-61acac03-6279b8a6274777eb44d81aae", 
    "X-Client-Data": "CJW2yQEIpLbJAQjEtskBCKmdygEIuevKAQjr8ssBCOaEzAEItoXMAQjLicwBCKyOzAEI3I7MARiOnssB" }
page = requests.get(URL, headers = headers)
soup = BeautifulSoup(page.content,'html.parser')

baseUrl = 'https://www.epant.gr'

data = {}

for href in [x['href'] for x in soup.select('a[href*=category]:has(span)')]:
    page = requests.get(f'{baseUrl}{href}', headers = headers)
    soup = BeautifulSoup(page.content,'html.parser')
    data[href.split('-')[-1].split('.')[0]] = {
        'url': f'{baseUrl}{href}'
    }
    data[href.split('-')[-1].split('.')[0]]['cases'] = [f'{baseUrl}{x["href"]}' for x in soup.select('h3 a')]
    
#Search every case-hmtl for "Ένδικα Μέσα" content

from bs4 import BeautifulSoup
import requests
import re

for url2 in data :
    headers1 = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36", 
        "X-Amzn-Trace-Id": "Root=1-61acac03-6279b8a6274777eb44d81aae", 
        "X-Client-Data": "CJW2yQEIpLbJAQjEtskBCKmdygEIuevKAQjr8ssBCOaEzAEItoXMAQjLicwBCKyOzAEI3I7MARiOnssB" }
    page = requests.get(url2, headers = headers1)
    soup = BeautifulSoup(page.content,"html.parser")
    if soup.find('td', text = "Ένδικα Μέσα").parent.get_text(strip=True) is TRUE :
        reqs = requests.get(url2)
        soup2 = BeautifulSoup(reqs.text, 'html.parser')
        print(url2.get('href'))
        row = soup.find('td', text = "Ένδικα Μέσα").parent.get_text(strip=True)
        print(row)

P.S.: If my post needs editing or formatting please let me know. Thank you.

EDIT : When I input the code you (HedgeHog) provided I got an SSL exception error.

I searched for a solution and came across this.

 proxy = 'http://78.130.136.2:8080'

With it my code runs perfectly. Thank you!

CodePudding user response:

Okay, now I get a light clue, what you try to do - You wont need the dict if you just want to scrape some information from the cases. You can generate all information in the flow of your process.

Example

from bs4 import BeautifulSoup
import requests

URL = 'https://www.epant.gr/apofaseis-gnomodotiseis/itemlist/category/78-2021.html'

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36", 
    "X-Amzn-Trace-Id": "Root=1-61acac03-6279b8a6274777eb44d81aae", 
    "X-Client-Data": "CJW2yQEIpLbJAQjEtskBCKmdygEIuevKAQjr8ssBCOaEzAEItoXMAQjLicwBCKyOzAEI3I7MARiOnssB" }
page = requests.get(URL, headers = headers)
soup = BeautifulSoup(page.content,'html.parser')

baseUrl = 'https://www.epant.gr'

for href in [x['href'] for x in soup.select('a[href*=category]:has(span)')]:
    page = requests.get(f'{baseUrl}{href}', headers = headers)
    soup = BeautifulSoup(page.content,'html.parser')

    urls = [f'{baseUrl}{x["href"]}' for x in soup.select('h3 a')]

    for url in urls :
        page = requests.get(url, headers = headers)
        soup = BeautifulSoup(page.content,'html.parser')
        row = soup.find('td', text = "Ένδικα Μέσα").parent.get_text(strip=True) if soup.find('td', text = "Ένδικα Μέσα") else None
        case = soup.find('h2').text.strip()
        year = case.split('/')[-1]
        print(f'{year},{case},{row},{url}')

Output

2021,Απόφαση 749/2021,Ένδικα Μέσα-,https://www.epant.gr/apofaseis-gnomodotiseis/item/1578-apofasi-749-2021.html
2021,Απόφαση 743/2021,Ένδικα Μέσα-,https://www.epant.gr/apofaseis-gnomodotiseis/item/1633-apofasi-743-2021.html
2021,Απόφαση 738/2021,Ένδικα Μέσα-,https://www.epant.gr/apofaseis-gnomodotiseis/item/1575-apofasi-738-2021.html
2021,Απόφαση 737/2021,Ένδικα Μέσα-,https://www.epant.gr/apofaseis-gnomodotiseis/item/1624-apofasi-737-2021.html
2021,Απόφαση 735/2021,Ένδικα Μέσα-,https://www.epant.gr/apofaseis-gnomodotiseis/item/1510-apofasi-735-2021.html
2021,Απόφαση 733/2021,Ένδικα Μέσα-,https://www.epant.gr/apofaseis-gnomodotiseis/item/1595-apofasi-733-2021.html
2021,Απόφαση 732/2021,Ένδικα ΜέσαΟριστική απόφαση. Δεν έχουν ασκηθεί ένδικα μέσα.,https://www.epant.gr/apofaseis-gnomodotiseis/item/1600-apofasi-732-2021.html
...
  • Related