Home > Back-end >  Problem scraping table on a website with BeautifulSoup and puppeteer
Problem scraping table on a website with BeautifulSoup and puppeteer

Time:08-29

I am trying to scrape the table contents (class = "technical-table") of this webpage:

https://www.bechtle.com/shop/lenovo-thinkpad-thunderbolt-3-dock-gen2--4310627--p

I have used BeautifulSoup(Python) and also Puppeteer(NodeJs) but i am getting blank records in both. If i however use JS on the google chrome browser I am able to scrape it.

Here is my code:

Python: 
    url = 'https://www.bechtle.com/shop/lenovo-thinkpad-thunderbolt-3-dock-gen2--4310627--p'
    page = requests.get(url)
    #time.sleep(10)
    soup = BeautifulSoup(page.content, "html.parser")
    #time.sleep(10)
    results = soup.find("table", { "class" : "headline-container" })

Node:

const puppeteer = require('puppeteer');

let scrape = async () => {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    await page.goto('https://www.bechtle.com/shop/lenovo-thinkpad-thunderbolt-3-dock-gen2--4310627--p');

    page.evaluate(() => {
        let elements = document.querySelector("#menu1 > div > table")
        return elements
    });

scrape().then((value) => {
    console.log(value);
});

I read through earlier posts and i tried the lxml parser, waiting times to let the page load fully etc.. but nothing seems to work. Any advice or direction is much appreciated :)

CodePudding user response:

You tagged this question python, so here is one solution of obtaining that data, by scraping the API accessed by page javascript to fill in the information:

import requests
import pandas as pd

url = 'https://www.bechtle.com/shop/lenovo-thinkpad-thunderbolt-3-dock-gen2--4310627--p/getProductDetails'

r = requests.get(url)
df = pd.json_normalize(r.json()['data']['featureGroups'])
print(df)

This will return a dataframe:

code    name    features
0   PRODUCT     Produkt     [{'code': 'PRODUCTNAME', 'name': 'Produktname', 'range': False, 'comparable': False, 'featureValues': [{'value': 'Lenovo ThinkPad Thunderbolt 3 Dock Gen2', 'valuePosition': 1}], 'sortingPosition': 1}, {'code': 'Typ', 'name': 'Produkttyp', 'range': False, 'comparable': False, 'featureValues': [{'value': 'Docking', 'valuePosition': 870}], 'sortingPosition': 870, 'topPriority': 0, 'categoryProperty': False}]
1   KONNEKTIVITAET  Konnektivität   [{'code': 'Anschluss', 'name': 'Anschlüsse', 'range': False, 'comparable': False, 'featureValues': [{'value': '1 x Thunderbolt 3 (Typ C)', 'valuePosition': 1}, {'value': '1 x USB 3.1 Typ A   Charge', 'valuePosition': 1}, {'value': '2 x DisplayPort', 'valuePosition': 2}, {'value': '2 x HDMI', 'valuePosition': 2}, {'value': '4 x USB 3.1 Typ A', 'valuePosition': 4}, {'value': '1 x Combo Mikrofon/Kopfhörer', 'valuePosition': 1}, {'value': '1 x RJ45', 'valuePosition': 1}], 'sortingPosition': 4, 'topPriority': 3, 'categoryProperty': False}]
2   SICHERHEIT  Sicherheit  [{'code': 'SicherheitsFeatures', 'name': 'Sicherheitsfunktionen', 'range': False, 'comparable': False, 'featureValues': [{'value': 'Kensington Standard Slot', 'valuePosition': 390}], 'sortingPosition': 390, 'topPriority': 0, 'categoryProperty': False}]
3   LEISTUNG    Leistung    [{'code': 'Auflösung KVM', 'name': 'Auflösung (bis zu)', 'range': False, 'comparable': False, 'featureValues': [{'value': '3.840 x 2.160 Pixel bei 60 Hz', 'valuePosition': 2500}], 'sortingPosition': 2500, 'topPriority': 2, 'categoryProperty': False}]
4   PHYSISCHEEIGENSCHAFTEN  Physische Eigenschaften     [{'code': 'Abmessungen (BxHxT)', 'name': 'Abmessungen (B x H x T)', 'range': False, 'comparable': False, 'featureValues': [{'value': '220 x 30 x 80 mm', 'valuePosition': 33000}], 'sortingPosition': 33000, 'topPriority': 0, 'categoryProperty': False}, {'code': 'Gewicht', 'name': 'Gewicht', 'range': False, 'comparable': False, 'featureValues': [{'value': '0,525 kg', 'valuePosition': 3690}], 'sortingPosition': 3690, 'topPriority': 0, 'categoryProperty': False}]
5   LIEFERUMFANG    Lieferumfang    [{'code': 'Lieferumfang', 'name': 'Lieferumfang', 'range': False, 'comparable': False, 'featureValues': [{'value': 'Netzteil (135 W)', 'valuePosition': 3920}, {'value': 'Thunderbolt (Typ C)-Kabel', 'valuePosition': 5440}], 'sortingPosition': 5440, 'topPriority': 0, 'categoryProperty': False}]
6   AUSFÜHRUNG  Ausführung  [{'code': 'Ausführung', 'name': 'Ausführung', 'range': False, 'comparable': False, 'featureValues': [{'value': 'Unspezifisch', 'valuePosition': 1}], 'sortingPosition': 1, 'topPriority': 0, 'categoryProperty': False}]
7   NaN     NaN     [{'code': 'Garantie', 'name': 'Herstellergarantie', 'range': False, 'comparable': False, 'featureValues': [{'value': '3 Jahre Bring-In (Details siehe Hersteller-Web-Site)', 'valuePosition': 33001}], 'sortingPosition': 33001, 'topPriority': 0, 'categoryProperty': False}]

You can drill down further in that json object response. Relevant pandas documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html

CodePudding user response:

Your Python script looks correct to me, but I think you're looking for the wrong HTML element. I took a quick look at the HTML code and you are right, it's this class ("technical-table"). So I think the Python code might look like this :

table = soup.find("table", { "class" : "technical-table" })
content_cell = table.find_all("td", { "class" : "content-cell" })

The "find_all" function allows to get all "td" tags in the table which have "content-cell" class. I'm giving you the way to code, not the answer. try to think. Bye Bye ! Have a great day ! ;)

  • Related