I am trying to scrape the table contents (class = "technical-table") of this webpage:
https://www.bechtle.com/shop/lenovo-thinkpad-thunderbolt-3-dock-gen2--4310627--p
I have used BeautifulSoup(Python) and also Puppeteer(NodeJs) but i am getting blank records in both. If i however use JS on the google chrome browser I am able to scrape it.
Here is my code:
Python:
url = 'https://www.bechtle.com/shop/lenovo-thinkpad-thunderbolt-3-dock-gen2--4310627--p'
page = requests.get(url)
#time.sleep(10)
soup = BeautifulSoup(page.content, "html.parser")
#time.sleep(10)
results = soup.find("table", { "class" : "headline-container" })
Node:
const puppeteer = require('puppeteer');
let scrape = async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://www.bechtle.com/shop/lenovo-thinkpad-thunderbolt-3-dock-gen2--4310627--p');
page.evaluate(() => {
let elements = document.querySelector("#menu1 > div > table")
return elements
});
scrape().then((value) => {
console.log(value);
});
I read through earlier posts and i tried the lxml parser, waiting times to let the page load fully etc.. but nothing seems to work. Any advice or direction is much appreciated :)
CodePudding user response:
You tagged this question python
, so here is one solution of obtaining that data, by scraping the API accessed by page javascript to fill in the information:
import requests
import pandas as pd
url = 'https://www.bechtle.com/shop/lenovo-thinkpad-thunderbolt-3-dock-gen2--4310627--p/getProductDetails'
r = requests.get(url)
df = pd.json_normalize(r.json()['data']['featureGroups'])
print(df)
This will return a dataframe:
code name features
0 PRODUCT Produkt [{'code': 'PRODUCTNAME', 'name': 'Produktname', 'range': False, 'comparable': False, 'featureValues': [{'value': 'Lenovo ThinkPad Thunderbolt 3 Dock Gen2', 'valuePosition': 1}], 'sortingPosition': 1}, {'code': 'Typ', 'name': 'Produkttyp', 'range': False, 'comparable': False, 'featureValues': [{'value': 'Docking', 'valuePosition': 870}], 'sortingPosition': 870, 'topPriority': 0, 'categoryProperty': False}]
1 KONNEKTIVITAET Konnektivität [{'code': 'Anschluss', 'name': 'Anschlüsse', 'range': False, 'comparable': False, 'featureValues': [{'value': '1 x Thunderbolt 3 (Typ C)', 'valuePosition': 1}, {'value': '1 x USB 3.1 Typ A Charge', 'valuePosition': 1}, {'value': '2 x DisplayPort', 'valuePosition': 2}, {'value': '2 x HDMI', 'valuePosition': 2}, {'value': '4 x USB 3.1 Typ A', 'valuePosition': 4}, {'value': '1 x Combo Mikrofon/Kopfhörer', 'valuePosition': 1}, {'value': '1 x RJ45', 'valuePosition': 1}], 'sortingPosition': 4, 'topPriority': 3, 'categoryProperty': False}]
2 SICHERHEIT Sicherheit [{'code': 'SicherheitsFeatures', 'name': 'Sicherheitsfunktionen', 'range': False, 'comparable': False, 'featureValues': [{'value': 'Kensington Standard Slot', 'valuePosition': 390}], 'sortingPosition': 390, 'topPriority': 0, 'categoryProperty': False}]
3 LEISTUNG Leistung [{'code': 'Auflösung KVM', 'name': 'Auflösung (bis zu)', 'range': False, 'comparable': False, 'featureValues': [{'value': '3.840 x 2.160 Pixel bei 60 Hz', 'valuePosition': 2500}], 'sortingPosition': 2500, 'topPriority': 2, 'categoryProperty': False}]
4 PHYSISCHEEIGENSCHAFTEN Physische Eigenschaften [{'code': 'Abmessungen (BxHxT)', 'name': 'Abmessungen (B x H x T)', 'range': False, 'comparable': False, 'featureValues': [{'value': '220 x 30 x 80 mm', 'valuePosition': 33000}], 'sortingPosition': 33000, 'topPriority': 0, 'categoryProperty': False}, {'code': 'Gewicht', 'name': 'Gewicht', 'range': False, 'comparable': False, 'featureValues': [{'value': '0,525 kg', 'valuePosition': 3690}], 'sortingPosition': 3690, 'topPriority': 0, 'categoryProperty': False}]
5 LIEFERUMFANG Lieferumfang [{'code': 'Lieferumfang', 'name': 'Lieferumfang', 'range': False, 'comparable': False, 'featureValues': [{'value': 'Netzteil (135 W)', 'valuePosition': 3920}, {'value': 'Thunderbolt (Typ C)-Kabel', 'valuePosition': 5440}], 'sortingPosition': 5440, 'topPriority': 0, 'categoryProperty': False}]
6 AUSFÜHRUNG Ausführung [{'code': 'Ausführung', 'name': 'Ausführung', 'range': False, 'comparable': False, 'featureValues': [{'value': 'Unspezifisch', 'valuePosition': 1}], 'sortingPosition': 1, 'topPriority': 0, 'categoryProperty': False}]
7 NaN NaN [{'code': 'Garantie', 'name': 'Herstellergarantie', 'range': False, 'comparable': False, 'featureValues': [{'value': '3 Jahre Bring-In (Details siehe Hersteller-Web-Site)', 'valuePosition': 33001}], 'sortingPosition': 33001, 'topPriority': 0, 'categoryProperty': False}]
You can drill down further in that json object response. Relevant pandas documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html
CodePudding user response:
Your Python script looks correct to me, but I think you're looking for the wrong HTML element. I took a quick look at the HTML code and you are right, it's this class ("technical-table"). So I think the Python code might look like this :
table = soup.find("table", { "class" : "technical-table" })
content_cell = table.find_all("td", { "class" : "content-cell" })
The "find_all" function allows to get all "td" tags in the table which have "content-cell" class. I'm giving you the way to code, not the answer. try to think. Bye Bye ! Have a great day ! ;)