Home > Software design >  How to display concatenated values resulting of 2 different loops of scraped URLs?
How to display concatenated values resulting of 2 different loops of scraped URLs?

Time:09-04

I'm pretty the solution is so easy, but I can't manage to find it : I made loops in loops to scrap all urls of a page.

For #1 : Product attributes

  • I can't manage to display on the same line all values resulting of the loops of:

attribzF valuezZF If I print(attribzF, valuezZF) : I'll only get the 1st value of the loop (whereas I should have 5)

For #2 : Product description :

How can I extract a specific

in a that contains 5 of them ? I can get all text from all

but not a single one. How do you differenciate them ?

Thanks a lot mates for the help !!

import requests
from bs4 import BeautifulSoup       


url='http://books.toscrape.com/catalogue/category/books/mystery_3/index.html'
u = requests.get(url)

soup = BeautifulSoup(u.content, 'html.parser')

for link in soup.findAll('article', {"class" : 'product_pod'}) :
    links = link.findAll('a')


    for lien in links :
        lienFinale = lien.get('href')
        lienComp = "http://books.toscrape.com/catalogue/"   lienFinale.strip('../../../')
        lienComp1 = lienComp.split(',')

        for l in lienComp1 :
            r=requests.get(l)
            soup2 = BeautifulSoup(r.content,'html.parser')
       

        #1. PRODUCT ATTRIBUTES :
        
            soupAp = soup2.findAll('table', class_='table table-striped')

            for attrib in soupAp :
                attribF = attrib.findAll('th')
                
                for attribz in attribF : 
                    attribzF = attribz.string
                         
                                       
            for valuez in soupAp :
                valuezF = valuez.findAll('td')
                
                for valuezZ in valuezF :
                    valuezZF = valuezZ.string        
          
                print(attribzF,valuezZF) 

            
        #2. DESCRIPTION : 

            descrip = soup2.find('article', class_="product_page") 
            descripFinal = descrip.findAll('p')

            for data in descripFinal :
                print(data.get_text())

CodePudding user response:

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
from pprint import pp


def get_soup(content):
    return BeautifulSoup(content, 'lxml')


def main(url):
    with requests.Session() as req:
        r = req.get(url)
        soup = get_soup(r.content).select('ol.row h3 > a')
        links = (urljoin(url, i['href']) for i in soup)
        for link in links:
            r = req.get(link)
            soup = get_soup(r.content)
            goal = soup.select_one('.table-striped').stripped_strings
            data = dict(zip(goal, goal))
            data['Description'] = soup.select_one(
                '#product_description   p').get_text(strip=True)
            pp(data)
            break


main('http://books.toscrape.com/catalogue/category/books/mystery_3/index.html')

CodePudding user response:

It do not need all these loops, try to change your strategy selecting elements may check css selectors and focus your process.

To get the product information you could use a dict comprehension that iterat all the rows of the table and creates key/value pairs based on the ResultSet of stripped_strings that is extracting the texts :

dict((row.stripped_strings) for row in soup2.select('table tr'))

Select the description based on id of its previous sibling:

soup2.select_one('#product_description   p').get_text()

Note: In newer code avoid old syntax findAll() instead use find_all() - For more take a minute to check docs

Example

Scraped results are stored in books as a list of dictionaries, so you work with a structure that could be easily iterated or converted into DataFrame, CSV, ...

import requests
from bs4 import BeautifulSoup
    
url='http://books.toscrape.com/catalogue/category/books/mystery_3/index.html'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

books = []

for a in soup.select('article h3 a') :

    r=requests.get("http://books.toscrape.com/catalogue/"   a.get('href').strip('../../../'))
    soup2 = BeautifulSoup(r.content,'html.parser')

    d=dict((row.stripped_strings) for row in soup2.select('table tr'))
    d['description'] = soup2.select_one('#product_description   p').get_text()
    ### d['title'] = soup2.h1.get_text()
    ### ... whatever information you want to add
    books.append(d)

books

Output

[{'UPC': 'e00eb4fd7b871a48',
  'Product Type': 'Books',
  'Price (excl. tax)': '£47.82',
  'Price (incl. tax)': '£47.82',
  'Tax': '£0.00',
  'Availability': 'In stock (20 available)',
  'Number of reviews': '0',
  'description': 'WICKED above her hipbone, GIRL across her heart Words are like a road map to reporter Camille Preaker’s troubled past. Fresh from a brief stay at a psych hospital, Camille’s first assignment from the second-rate daily paper where she works brings her reluctantly back to her hometown to cover the murders of two preteen girls. NASTY on her kneecap, BABYDOLL on her leg Since WICKED above her hipbone, GIRL across her heart Words are like a road map to reporter Camille Preaker’s troubled past. Fresh from a brief stay at a psych hospital, Camille’s first assignment from the second-rate daily paper where she works brings her reluctantly back to her hometown to cover the murders of two preteen girls. NASTY on her kneecap, BABYDOLL on her leg Since she left town eight years ago, Camille has hardly spoken to her neurotic, hypochondriac mother or to the half-sister she barely knows: a beautiful thirteen-year-old with an eerie grip on the town. Now, installed again in her family’s Victorian mansion, Camille is haunted by the childhood tragedy she has spent her whole life trying to cut from her memory. HARMFUL on her wrist, WHORE on her ankle As Camille works to uncover the truth about these violent crimes, she finds herself identifying with the young victims—a bit too strongly. Clues keep leading to dead ends, forcing Camille to unravel the psychological puzzle of her own past to get at the story. Dogged by her own demons, Camille will have to confront what happened to her years before if she wants to survive this homecoming.With its taut, crafted writing, Sharp Objects is addictive, haunting, and unforgettable. ...more'},
 {'UPC': '19ed25f4641d5efd',
  'Product Type': 'Books',
  'Price (excl. tax)': '£19.63',
  'Price (incl. tax)': '£19.63',
  'Tax': '£0.00',
  'Availability': 'In stock (18 available)',
  'Number of reviews': '0',
  'description': "In a dark, dark wood Nora hasn't seen Clare for ten years. Not since Nora walked out of school one day and never went back. There was a dark, dark houseUntil, out of the blue, an invitation to Clare’s hen do arrives. Is this a chance for Nora to finally put her past behind her?And in the dark, dark house there was a dark, dark roomBut something goes wrong. Very wrong.And i In a dark, dark wood Nora hasn't seen Clare for ten years. Not since Nora walked out of school one day and never went back. There was a dark, dark houseUntil, out of the blue, an invitation to Clare’s hen do arrives. Is this a chance for Nora to finally put her past behind her?And in the dark, dark house there was a dark, dark roomBut something goes wrong. Very wrong.And in the dark, dark room.... Some things can’t stay secret for ever. ...more"},...]
  • Related