Home > database >  How to iterate through list of div at a certain level using BeautifulSoup?
How to iterate through list of div at a certain level using BeautifulSoup?

Time:10-22

I'm trying to webscrape the news from the following URL: screenshot of the page structure

CodePudding user response:

As one of the comments mentioned, you can use enter image description here

as you can see above, there is no div id="search" element; in such cases, the commented out selector might work.


Sample usage:

# selectors for headerless request in comments

blockSel = '#search div[eid] div[data-hveid][data-ved] > div[data-hveid]'
# blockSel = '#main > div > div > a'

innerSels = {
    'heading': 'div[role="heading"]', # 'h3',
    'link': (None, 'href'), # (None, 'href'),
    'snippet': 'div[role="heading"]   div', # '"parent"a > div   div',
    'date': 'div[role="heading"] ~ span   div', # 'div   div div > span',
    'site_name': 'g-img   span' # 'h3   div'
}  

articles = []

srSectn = soup.select(blockSel)
srsLen = len(srSectn)
for i, s in enumerate(srSectn): 
    if s.select_one('a[jsname]'): s = s.select_one('a[jsname]')
    print('', end=f'\radding article {i 1} of {srsLen}...')

    aData = {}
    for k in innerSels:
        sel = innerSels[k]
        target = '"text"'
        if type(sel) in [tuple, list] and len(sel) > 1:
            target = None if sel[1] is None else str(sel[1])
            sel = None if sel[0] is None else str(sel[0])
        
        if type(sel) == str and sel.startswith('"parent"'): 
            sel = s.parent.select_one(sel.replace('"parent"', '', 1))
        else: 
            sel = s if sel is None else s.select_one(sel)
        if sel is None:
            aData[k] = None
            continue

        if target is None: 
            aData[k] = str(sel)
        elif target == '"text"': 
            aData[k] = sel.get_text(strip=True)
        else: aData[k] = sel.get(target)
        
    articles.append(aData)

print(f'\radded {len(articles)} articles from {srsLen} sections')

Data collected: enter image description here

CodePudding user response:

Just in addition to simplify the selection a bit - You could use all <a> that has a <h3> as container for your iterations:

soup.select('a:has(h3)')

Example

Uses cookies={'CONSENT':'YES '} cause it is necessary from my location to set, but could be ignored from yours.

from bs4 import BeautifulSoup
import requests
import pandas as pd

url = 'https://www.google.com/search?num=250&q=Apple innovation performance&oq=Apple innovation performance=1600&source=lnt&tbs=cdr:1,cd_min:1/1/2018,cd_max:12/31/2018&tbm=nws&hl=en-US'
response = requests.get(url,headers = {'User-Agent': 'Mozilla/5.0'}, cookies={'CONSENT':'YES '})

data = []

soup = BeautifulSoup(response.text)
for e in soup.select('a:has(h3)'):
    data.append({
        'title': e.h3.get_text(),
        'date': e.span.get_text(),
        'excerpt':e.br.previous,
        'site': e.h3.find_next_sibling('div').get_text() if e.h3.find_next_sibling('div') else None,
        'url': e.get('href').strip('/url?q=')
    })
pd.DataFrame(data)

Output

title date excerpt site url
0 Apple iPhone 14: Premium Smartphone with Innovation Issues 4 days ago The Apple iPhone 14 performs very well in our review and achieves top scores primarily in the performance, display, and camera categories. NotebookCheck.net https://www.notebookcheck.net/Apple-iPhone-14-Premium-Smartphone-with-Innovation-Issues.661772.0.html&sa=U&ved=2ahUKEwicrvmP7PD6AhVPDuwKHQqAA5IQxfQBegQIBxAC&usg=AOvVaw3iLsrux3Epc_tFQbj7MUh1
1 Improvement over innovation: are Apple's latest products a letdown? 16 hours ago Improvement over innovation: are Apple's latest products a letdown? ... for those willing to pay more for Pro models, improved performance. The Oxford Student https://www.oxfordstudent.com/2022/10/20/improvement-over-innovation-are-apples-latest-products-a-letdown/&sa=U&ved=2ahUKEwicrvmP7PD6AhVPDuwKHQqAA5IQxfQBegQIYxAC&usg=AOvVaw1gwOjOUJox68KMeiPj_CR8
2 Apple iPad Pro: New generation with M2 chip, Wi-Fi 6E and more 2 days ago A 10-core GPU with up to 35 percent faster graphics performance ... Otherwise, the innovations of the new Apple iPad Pro are in the details. Basic Tutorials https://basic-tutorials.com/news/apple-ipad-pro-new-generation-with-m2-chip-wi-fi-6e-and-more/&sa=U&ved=2ahUKEwicrvmP7PD6AhVPDuwKHQqAA5IQxfQBegQIYhAC&usg=AOvVaw07b023WuvT4kB1XoO2-yla
3 Apple ditched Intel and never looked back — are other laptop ... 15 hours ago Called the M1 chip, Apple stuffed it inside a Mac Mini, ... products with the most extraordinary performance, most innovative technology,... Laptop Mag https://www.laptopmag.com/features/apple-ditched-intel-and-never-looked-back-are-other-laptop-makers-doing-the-same&sa=U&ved=2ahUKEwicrvmP7PD6AhVPDuwKHQqAA5IQxfQBegQIYRAC&usg=AOvVaw3IjNFjvFz_B3B3Xzu767gh
4 Chinese supply chain able to cope with impact of potential Apple ... 1 day ago One of Apple's manufacturers in China has been instructed to immediately ... assessment of the iPhone 14 series' performance in the Chinese... Global Times https://www.globaltimes.cn/page/202210/1277497.shtml&sa=U&ved=2ahUKEwicrvmP7PD6AhVPDuwKHQqAA5IQxfQBegQIXhAC&usg=AOvVaw2dUujWfIkidI1-TpdWsqvh

...

  • Related