I would like to scrape the following data table from a website:
<body style="background-color:grey;">
<div id="myTable" style="display: table;">
<div style="background-color: #4CAF50; color: white;">
<div >Nickname</div>
<div >Server IP</div>
<div >IP</div>
<div >Region</div>
<div >Country</div>
<div >City</div>
<div >Score <input type="checkbox" onchange="mysrt(this)" id="chkscr"></div>
<div >Update Time <input type="checkbox" onchange="mysrt(this)" id="chkupd" checked="" disabled="">
</div>
<div >Auth Key</div>
<div >Key Owner</div>
<div >Version</div>
<div >Details</div>
</div>
<div >
<div >Player 1</div>
<div >_GAME_MENU_</div>
<div >x.x.226.35</div>
<div >North America</div>
<div >United States</div>
<div >Cleveland</div>
<div >21</div>
<div >2022-12-29 10:17:01 (GMT-8)</div>
<div >SecretauthK3y</div>
<div >CoolName</div>
<div >7.11</div>
<div >FPS: 93 @ 0(0) ms @ 0 K/m</div>
</div>
<div >
<div >PlayerB</div>
<div >_GAME_MENU_</div>
<div >x.x.90.221</div>
<div >North America</div>
<div >United States</div>
<div >Mechanicsville</div>
<div >67991</div>
<div >2022-12-29 10:16:56 (GMT-8)</div>
<div >SecretauthK3y2</div>
<div >PlayerB</div>
<div >7.12</div>
<div >FPS: 50 @ 175(243) ms @ 0 K/m</div>
</div>
<div >
<div >McChicken</div>
<div >_GAME_MENU_</div>
<div >x.x.39.80</div>
<div >North America</div>
<div >United States</div>
<div ></div>
<div >0</div>
<div >2022-12-29 09:41:44 (GMT-8)</div>
<div >SecretauthK3y3</div>
<div >SOLO KEY</div>
<div >7.12</div>
<div >FPS: 63 @ 0(0) ms @ 0 K/m</div>
</div>
</div>
It has a header row under .tr and then each row of data is represented by the div with .tr mytarget. Normally there are hundreds of more .tr_mytarget rows which all have an identical format to the three shown. My goal is to scrape this data in such a way that will make it easy to then perform some calculations and filtering to it. It will eventually be re-used in a new data table.
I have a small amount of experience with JS so my idea was to use puppeteer. My question is two fold; In what format should I scrape the data so that it's in an appropriate format to use and How do I write the puppeteer statements to do this?
This is what I have so far:
import puppeteer from 'puppeteer';
(async () => {
const browser = await puppeteer.launch({headless: false});
const page = await browser.newPage();
await page.goto('redactedurl.com');
await page.waitForSelector('#myTable');
const nicks = await page.$$eval('.table .tr_mytarget .td_tnic', allNicks => allNicks.map(td_tnick => td_tnick.textContent));
await console.log(nicks);
I dont fully understand how to write the $$eval statement. Also Im thinking I will want one array for the header and one for the data but I'm not sure. What's recommended? Thanks in advance for any advice.
Reference: https://pptr.dev/api/puppeteer.page._eval/
CodePudding user response:
To extract the data from the table in a structured format, you can use the following approach:
Extract the header row from the table and use it to create an array of column names.
Iterate over the rows with the mytarget class and extract the data from each cell. Use the column names to create an object for each row, with the column names as the keys and the cell data as the values.
Push each row object into an array to create a final array of objects that represents the data in the table.
Here is an example of how you could do this:
const puppeteer = require('puppeteer');
async function scrapeTable() {
// Launch a new browser instance
const browser = await puppeteer.launch();
// Create a new page
const page = await browser.newPage();
// Navigate to the page with the table
await page.goto('http://example.com/table-page');
// Extract the data from the table
const data = await page.evaluate(() => {
// Extract the header row
const headerRow = document.querySelector('.table .tr');
const columnNames = Array.from(headerRow.querySelectorAll('.td')).map(cell => cell.textContent);
// Extract the data rows
const dataRows = document.querySelectorAll('.table .tr.mytarget');
const data = [];
for (const row of dataRows) {
// Extract the data from each cell
const cells = row.querySelectorAll('.td');
const rowData = {};
for (let i = 0; i < cells.length; i ) {
rowData[columnNames[i]] = cells[i].textContent;
}
data.push(rowData);
}
return data;
});
console.log(data);
// Close the browser
await browser.close();
}
scrapeTable();
This code will extract the data from the table and create an array of objects that represent the data in the table. Each object will have the column names as the keys and the cell data as the values.
I hope this helps!