Home > Blockchain >  How to scrape the product information from the page
How to scrape the product information from the page

Time:10-10

I'm trying to scrape thetechnical detail table from the product information but they will provide me the empty list the link of page in which I try to scrape table is https://www.amazon.com/Hammermill-Letter-Bright-Sheets-113640C/dp/B072FVQNWM/ref=sr_1_6?dchild=1&qid=1633771276&s=office-products&sr=1-6

import requests
from bs4 import BeautifulSoup
import pandas as pd
from urllib.parse import urljoin
base_url='https://www.amazon.com'
productlinks=[]
results = [] 
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36','session':'141-2320098-4829807'}
cookies= {'session': '17ab96bd8ffbe8ca58a78657a918558'}
cookies=cookies
r = requests.get('https://www.amazon.com/s?rh=n:1069242&fs=true&ref=lp_1069242_sar', headers = headers)
soup = BeautifulSoup(r.content, 'lxml')
for link in soup.find_all('a',class_="a-link-normal s-underline-text s-underline-link-text a-text-normal",href=True):
    p=link['href']
    l=urljoin(base_url,p)
    productlinks.append(l)
    
results = []    
for link in productlinks:
        r =requests.get(link,headers=headers)
        soup=BeautifulSoup(r.content, 'html.parser')
        try:
            for tr in soup.find('table', id='productDetails_techSpec_section_1').find_all('tr') :
                print(tr.text.strip())
                results.append(tr.text.strip())
        except:
            continue
print(results)

CodePudding user response:

This is the output I get:

['ManufacturerAmazon Basics', 'BrandAmazon Basics', 'Item Weight41.6 pounds', 'Product Dimensions18 x 11.8 x 9 inches', 'Item model numberAMZN8RM', 'ColorWhite', 
'Material TypePaper', 'Number of Items8', 'Size8 Reams | 4000 Sheets', 'Sheet Size8.5-x-11-inch', 'Brightness Rating92', 'Paper Weight20', 'Paper FinishSmooth', 'Manufacturer Part NumberAMZN8RM', 'ManufacturerZebra Pen Corporation', 'BrandZebra Pen', 'Item Weight0.336 ounces', 'Product Dimensions1.1 x 6.5 x 7.5 inches', 'Item model number22218', 'Is Discontinued By ManufacturerNo', 'ColorBlack', 'ClosureRetractable', 'Grip TypeRubber', 'Material TypePlastic, Metal, Rubber', 'Number of Items18', 'Size18-Pack', 'Point TypeMedium', 'Line Size1.00 Pen', 'Ink ColorBlack', 'Manufacturer Part Number22218', 'Manufacturer3M Office Products', 'BrandScotch', 'Item Weight3.68 ounces', 'Product Dimensions7.8 x 7.1 x 3 inches', 'Item model number142-6', 'Is Discontinued By ManufacturerNo', 'ColorClear', 'Material TypeSynthetic Rubber Resin', 'Number of Items1', 'Size6 Count', 'Manufacturer Part Number142-6', 'National Stock Number6520-01-356-3964, 5970-01-137-7860, 7530-00-598-7711', 'ManufacturerInternational Paper (Office)', 'BrandHammermill', 'Item Weight40 pounds', 'Product Dimensions17.25 x 11.75 x 8.25 inches', 'Item model number113640C', 'Is Discontinued By ManufacturerNo', 'Color8 Ream | 4000 Sheets', 'Cover MaterialPaper', 'Material TypePaper', 'Number of Items8', 'Size8 Ream | 4000 Sheets', 'Sheet Size8.5 x 11', 'Brightness Rating92', 'Paper Weight20', 'Paper FinishSmooth', 'Manufacturer Part Number113640C', 'ManufacturerNewell Rubbermaid Office', 'BrandEXPO', 'Item Weight2.4 ounces', 'Product Dimensions5.5 x 6.25 x 4.02 inches', 'Item model number1884309', 'Is Discontinued By ManufacturerNo', 'ColorAssorted', 'Grip TypeThumb', 'Material TypePlastic', 'Number of Items1', 'Size8-Count', 'Point TypeUltra Fine', 'Line Size0.5mm millimeters', 'Ink ColorMulticolor', 'Tip TypeFine point', 'Manufacturer Part Number1884309', 'Manufacturer3M Office Products', 'BrandScotch', 'Item Weight3.06 pounds', 'Product Dimensions0.75 
x 8.9 x 11.4 inches', 'Item model numberTP3854-100', 'Is Discontinued By ManufacturerNo', 'ColorClear', 'Material TypeLaminate', 'Number of Items1', 'PackagingRetail', 'Size100-Pack', 'Paper FinishGlossy', 'Manufacturer Part NumberTP3854-100', 'ManufacturerScotch', 'BrandScotch', 'Item Weight10.6 ounces', 'Product Dimensions4.2 x 6.4 x 3.05 inches', 'Item model number6122', 'Is Discontinued By ManufacturerNo', 'ColorTransparent', 'Material TypePlastic', 'Number of Items1', 'Size6 Rolls', 'Manufacturer Part Number6122', 'Manufacturer\tGorilla Glue', 'Part Number\t7700104', 'Item Weight1.5 ounces', 'Product Dimensions1.25 x 3.38 x 6.63 inches', 'Item model number7700104', 'Is Discontinued By ManufacturerNo', 'Size1 Pack', 'ColorClear', 'Style1 - Pack', 'PatternSuper Glue', 'Item Package Quantity1', 'Included Components1 bottle glue', 'Batteries Included?No', 'Batteries Required?No', 'Warranty DescriptionNo', 'Manufacturer0', 'BrandSHARPIE', 'Item Weight3.2 ounces', 'Product Dimensions1 x 1 x 1 inches', 'Item model number30001', 'Is Discontinued By ManufacturerNo', 'ColorBlack (Box)', 'Material TypeAluminum', 'Number of Items1', 'Size12-Count', 'Point TypeFine', 'Line Size0.3mm', 'Ink ColorBlack', 'Tip TypeFine', 'Manufacturer Part NumberSAN30001', 'National Stock Number7520-00-904-1265', 'Manufacturer0', 'BrandSHARPIE', 'Item Weight3.2 ounces', 'Product Dimensions1 x 1 x 1 inches', 'Item model number30001', 'Is Discontinued By ManufacturerNo', 'ColorBlack (Box)', 'Material TypeAluminum', 'Number of Items1', 'Size12-Count', 'Point TypeFine', 'Line Size0.3mm', 'Ink ColorBlack', 'Tip TypeFine', 'Manufacturer Part NumberSAN30001', 'National Stock Number7520-00-904-1265', 'ManufacturerAimoh', 'BrandAimoh', 'Item Weight1.4 pounds', 'Product Dimensions9.7 x 4.3 x 2.2 inches', 'Item model number34100', 'Is Discontinued By ManufacturerNo', 'ColorWhite', 'ClosureSelf-Seal', 'Material TypePaper', 'Size100 Ct.', 'Sheet Size4.125-x-9.5-inch', 'Paper Weight24', 'Paper FinishWove', 'Manufacturer Part Number34100', 'ManufacturerHP Papers', 'BrandHP Papers', 'Item Weight15 pounds', 'Product Dimensions11 x 8.5 x 6.25 inches', 'Item model number112090', 'Is Discontinued By ManufacturerNo', 'Material TypePaper', 'Number of Items1', 'Size3 Ream | 1500 Sheets', 'Brightness Rating92', 'Paper Weight20', 'Paper FinishSmooth', 'Manufacturer Part Number112090', 'ManufactureriBayam', 'BrandIBayam', 'Item Weight3.84 ounces', 'Product Dimensions6.6 x 6.2 x 0.6 inches', 'Item model number18 Pack', 'Is Discontinued By ManufacturerNo', 'ColorBlack, Grey, Red, Blue, Magenta, Pink, 
Purple, Violet, Pale Yellow, Yellow, Orange, Raw Sienna, Sap Green, C Green, O Green, Lake Blue, Burnt Sienna, Crimson', 'ClosurePush Button', 'Grip TypeContoured', 'Material TypePlastic', 'Number of Items18', 'Size18 Unique Colors', 'Point TypeFine', 'Manufacturer Part Number61', 'ManufacturerAmazon Basics', 'BrandAmazon 
Basics', 'Item Weight6.7 ounces', 'Product Dimensions7.4 x 0.3 x 0.3 inches', 'Item model numberPHB-30', 'ColorYellow', 'Pencil Lead Degree (Hardness)HB', 'Material TypeWood', 'Number of Items30', 'Size30 Count (Pack of 1)', 'Point TypeMedium', 'Manufacturer Part NumberPHB-30', 'ManufacturerInternational Paper (Office)', 'BrandHammermill', 'Item Weight15 pounds', 'Product Dimensions11.25 x 8.75 x 6.25 inches', 'Item model number113620', 'Is Discontinued By ManufacturerNo', 'Material TypePaper', 'Number of Items3', 'Size3 Ream | 1500 Sheets', 'Sheet Size8.5 x 11', 'Brightness Rating92', 'Paper Weight20', 'Paper FinishSmooth', 'Manufacturer Part Number113620', 'Manufacturer\tiBayam', 'Part Number\t5234', 'Item Weight1.44 ounces', 'Product Dimensions4 x 3 x 0.6 inches', 'Item model number2 Pack', 'ColorPink & Black', 'MaterialFiberglass', 'Item Package Quantity1', 'Plug ProfileSewing', 'Batteries Included?No', 'Batteries Required?No', 'ManufacturerHewlett Packard SOHO Consumables', 'BrandHP Papers', 'Item Weight6 pounds', 'Product Dimensions11 x 8.5 x 12 inches', 'Item model number203000', 'Is Discontinued By ManufacturerNo', 'ColorWhite', 'Number of Items1', 'Size1 Ream | 500 Sheets', 'Sheet Size8.5 x 11 inch', 'Brightness Rating97', 'Paper Weight24', 'Paper FinishMatte', 'Manufacturer Part Number203000']

I just appended all the data to the result list and print it, and put the for loop which reads all trs in a try & except, since in some of the links in productlinks, there is no tr:

[...]
results = []    
for link in productlinks:
        r =requests.get(link,headers=headers)
        soup=BeautifulSoup(r.content, 'html.parser')
        try:
            for tr in soup.find('table', id='productDetails_techSpec_section_1').find_all('tr') :
                res = "".join(tr.text.strip().split("\n\n\n\u200e"))
                print(res)
                results.append(res)
        except:
            continue
        
print(results)

CodePudding user response:

I've provided a working solution below.

I find the table element by using the tables ID tag (You can use chrome developer tools to inspect the HTML).

After finding the table we iterate over all the rows in the table. The first column's data was contained in the th tag and the second column's data was contained in the td tag. We extract the text and strip any new lines. After which i saved the results of the table in a dictionary named results.

Finally we iterate through the results dictionary to list out the technical details as requested.

from bs4 import BeautifulSoup
import requests

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36','session':'141-2320098-4829807'}
# Provided URL
url =r'https://www.amazon.com/dp/B072FVQNWM'
HTMLpage = requests.get(url, headers=headers)

# Parrsing the page
soup = BeautifulSoup(HTMLpage.content, 'html.parser')
# Finding the technical details table
tech_table = soup.find('table', id='productDetails_techSpec_section_1')
 # All rows in the table
rows = tech_table.find_all('tr')
results = {"id":[],"val":[]}
for r in rows:
    # Access the th tag and retrun value
    id = r.th.text.strip('\n')
    # Access the td tag and retrun value
    val = r.td.text.strip('\n')
    # save to results
    results['id'].append(id)
    results['val'].append(val)

# print output
for x in range(len(results['id'])):
    print(f'{results["id"][x]}: {results["val"][x]}')
  • Related