Home > Back-end >  Puppeteer / Excel VBA: how to scrape table data from website
Puppeteer / Excel VBA: how to scrape table data from website

Time:12-16

I have used Excel VBA & IE to access website financial data tables for 15 years, but this is now obsolete. So I am trying to use Puppeteer on a Raspberry Pi to dump the website data to disk which Excel VBA on Win10 can then pick up and process. Getting the page data is easy enough, but how do I convert the page data, retrieved via page.content(), into a useful format? I'm very new to Puppeteer HTML etc..

 const browser = await puppeteer.launch({headless: false});
 const page = await browser.newPage();
 await page.goto('https://www.morningstar.co.uk/uk');
 await page.setViewport({width: 1000, height: 1000})
 wait page.goto (portfolio URL);
 const fs = require('fs');
 fs.writeFileSync('ms.txt', await page.content());

The relevant data in ms.txt looks like this:

<table cellspacing="0" border="0" id="ctl00_ctl00_MainContent_PM_MainContent_gv_Portfolio" style="width:100%;border-collapse:collapse;">
    <tbody><tr >
        <th  scope="col"><img  src="../../includes/images/arrow_asc_small.gif" align="middle" style="border-width:0px;"> <a href="javascript:__doPostBack('ctl00$ctl00$MainContent$PM_MainContent$gv_Portfolio','Sort$SecurityName')">Holding</a></th><th  scope="col"><a href="javascript:__doPostBack('ctl00$ctl00$MainContent$PM_MainContent$gv_Portfolio','Sort$StarRatingM255')">Morningstar<br>Rating</a></th><th  scope="col"><a href="javascript:__doPostBack('ctl00$ctl00$MainContent$PM_MainContent$gv_Portfolio','Sort$ClosePrice')">Current<br>Price</a></th><th  scope="col">&nbsp;</th><th  scope="col"><a href="javascript:__doPostBack('ctl00$ctl00$MainContent$PM_MainContent$gv_Portfolio','Sort$ReturnD1')">Price<br>Change<br>%</a></th><th  scope="col"><a href="javascript:__doPostBack('ctl00$ctl00$MainContent$PM_MainContent$gv_Portfolio','Sort$Weight')">Weight<br>%</a></th><th  scope="col"><a href="javascript:__doPostBack('ctl00$ctl00$MainContent$PM_MainContent$gv_Portfolio','Sort$ClosePriceDate')">Date</a></th>

And the table elements:

</tr><tr >
<td  style="width:375px;"><a href="/uk/stockquicktake/default.aspx?id=0P0000M5OM">Burford Capital Ltd</a></td><td  style="width:95px;"><span>Not Rated</span></td><td title="09/12/2022"  style="width:75px;">7.1300</td><td  style="width:20px;">GBP</td><td  style="width:60px;">0.56</td><td  style="width:70px;">0.0000</td><td  style="width:80px;">09/12/2022</td>
</tr><tr >
<td  style="width:375px;"><a href="/uk/stockquicktake/default.aspx?id=0P00007YPZ">Centrica PLC</a></td><td  style="width:95px;"><img src="../../includes/images/5stars.gif" style="border-width:0px;"></td><td title="09/12/2022"  style="width:75px;">0.9224</td><td  style="width:20px;">GBP</td><td  style="color:Red;width:60px;">-0.30</td><td  style="width:70px;">0.0000</td><td  style="width:80px;">09/12/2022</td>
</tr><tr >

I have a slight preference for deciphering the web data in Puppeteer, but doing it in VBA is also fine if it is simpler. I do want it to be simple and reliable, and to work for another 15 years so I want to avoid unusual 3rd party add-ins. Using Selenium seems complicated and over the top for my purposes, although I am not wedded to Puppeteer if there is a simpler method.

CodePudding user response:

I agree puppeteer is simple and reliable.

This is demo page

enter image description here

Find selector

if you click F12 in Chrome, you can see the HTML source code.

And select Element tab and click this icon enter image description here

Then hove your mouse on the table, You can see the match and class name, id string and xpath of tag.

enter image description here

I will scrap Interim Result and Income Statement table by selector of querySelectorAll

First table

<div id="FinancialsInterimPrelimResultsHemscott" >
    <table >
        <caption >Interim Results </caption>
        <colgroup>
...
        </colgroup>
        <thead>
            <tr>
                <th  id="MsStockReportFiprhDp" scope="col"></th>
                <th scope="col"  id="MsStockReportFiprhY1">30/06/2021</th>
                <th scope="col"  id="MsStockReportFiprhY2">30/06/2022</th>
            </tr>
        </thead>
        <tbody>
            <tr>
                <th headers="MsStockReportFiprhDp" scope="row">Turnover</th>
                <td  headers="MsStockReportFiprhY1">64.23</td>
                <td  headers="MsStockReportFiprhY2">65.41</td>
            </tr>
...
        </tbody>
    </table>
</div>

It's tree of tag is

<div id="FinancialsInterimPrelimResultsHemscott">
  <table>
     <tbody>
        <tr>
          <td>64.23</td>

The tree of tag level and id can find 64.23 data

[id='FinancialsInterimPrelimResultsHemscott'] table tbody tr td

If call

document.querySelectorAll("[id='FinancialsInterimPrelimResultsHemscott'] table tbody tr td")

Will return all of table data. That is main Idea.

This is demo code

From this code, call getData() function with two parameter first is URL, it will what is looking for Web page second is URL, it will what is looking for table by ID.

const puppeteer = require("puppeteer");

async function getData(url, table_id) {
    try {
        const browser = await puppeteer.launch();
        const page = await browser.newPage();
        await page.goto(url);

        titles = await page.evaluate((table_id) => {
            const selector = `[id=\'${table_id}\'] table tbody tr td`
            return Array.from(document.querySelectorAll(selector),
                title => title.innerText.trim());
        }, table_id);

        await browser.close();
        return Promise.resolve(titles);
    } catch (error) {
        return Promise.reject(error);
    }
}

url = 'https://tools.morningstar.co.uk/uk/stockreport/default.aspx?tab=10&vw=fs&SecurityToken=0P0000M5OM]3]0]E0WWE$$ALL&Id=0P0000M5OM&ClientFund=0&CurrencyId=BAS'

getData(url,'FinancialsInterimPrelimResultsHemscott')
    .then((titles) => {
        console.log("FinancialsInterimPrelimResultsHemscott");
        console.log(titles);
    })

getData(url,'FinancialsIncomeStatementSummaryHemscott')
    .then((titles) => {
        console.log("FinancialsIncomeStatementSummaryHemscott");
        console.log(titles);
    })

Running and Result

First result is Income Statement Second table result is Interim Result The order is flipped due to async/await not guarantee is order call That is why I log the table ID in code.

$ node get-data.js
FinancialsIncomeStatementSummaryHemscott
[
  '319.11', '386.44', '326.42', '316.68',
  '-12.76', '-',      '-',      '-',
  '-',      '-',      '-',      '-',
  '-',      '-',      '-',      '-',
  '-',      '-',      '-',      '-',
  '249.18', '305.11', '194.21', '201.72',
  '-59.44', '249.30', '317.58', '180.80',
  '164.78', '-56.43', '249.30', '317.58',
  '180.80', '164.78', '-72.07', '229.46',
  '293.00', '152.37', '164.78', '-113.12',
  '1.20',   '1.50',   '0.82',   '0.75',
  '0.00'
]
FinancialsInterimPrelimResultsHemscott
[
  '64.23',  '65.41',
  '-20.41', '11.22',
  '-0.13',  '-0.10',
  '0.05',   '0.06'
]

Next Steps

You may needs those items

#1 What page's table want to scrap - it is hard question, I recommend enter image description here

Real Demo with Questioner HTML data.

Code

const puppeteer = require("puppeteer");

async function getData(url,header_class,  row_class) {
    try {
        const browser = await puppeteer.launch();
        const page = await browser.newPage();
        await page.goto(url);

        titles = await page.evaluate((header_class, row_class) => {
            // table header
            // <tr >
            // 'Holding\tMorningstar\nRating\tCurrent\nPrice\t \tPrice\nChange\n%\tWeight\n%\tDate'
            const headerSelector = `tr[class=\'${header_class}\']`
            headers = Array.from(document.querySelectorAll(headerSelector), row => row.innerText.trim());

            // table rows data
            // <tr >
            //   '3i Infrastructure Ord\tNot Rated\t3.2900\tGBP\t-0.30\t0.0000\t14/12/2022',
            const rowSelector = `tr[class=\'${row_class}\']`
            rows = Array.from(document.querySelectorAll(rowSelector), row => row.innerText.trim());

            // add header at the begin from rows data
            rows = headers.concat(rows)
            return rows.map((row) => {
                return row.split('\t');
            });
        }, header_class, row_class);

        await browser.close();
        return Promise.resolve(titles);
    } catch (error) {
        return Promise.reject(error);
    }
}

// modify your target URL
url = 'http://127.0.0.1:5500/MS-Portfolio.html'

getData(url,'gridHeader', 'gridItem')
    .then((rows) => {
        console.log(rows);
        console.log(rows.length);
    })

Result

$ node get-table.js
[
  [
    'Holding',
    'Morningstar\nRating',
    'Current\nPrice',
    ' ',
    'Price\nChange\n%',
    'Weight\n%',
    'Date'
  ],
  [
    '3i Infrastructure Ord',
    'Not Rated',
    '3.2900',
    'GBP',
    '-0.30',
    '0.0000',
    '14/12/2022'
  ],
  [
    'abrdn Asia Pacific Equity I Acc',
    '',
    '3.5291',
    'GBP',
    '0.28',
    '16.6667',
    '14/12/2022'
  ],
  [
    'Amati AIM VCT Ord',
    'Not Rated',
    '1.2650',
    'GBP',
    '0.00',
    '0.0000',
    '14/12/2022'
  ],
  [ 'Aviva PLC', '', '4.4740', 'GBP', '0.16', '0.0000', '14/12/2022' ],
  [
    'Baronsmead Second Venture Trust Ord',
    'Not Rated',
    '0.6300',
    'GBP',
    '0.00',
    '0.0000',
    '14/12/2022'
  ],
  [ 'Basf SE', '', '47.2300', 'EUR', '-0.63', '0.0000', '14/12/2022' ],
  [
    'BH Macro USD Ord',
    '',
    '46.7000',
    'USD',
    '-0.29',
    '0.0000',
    '14/12/2022'
  ],
  [
    'BlackRock World Mining Trust Ord',
    '',
    '6.8100',
    'GBP',
    '-1.16',
    '0.0000',
    '14/12/2022'
  ],
  [
    'British American Tobacco PLC',
    '',
    '32.6800',
    'GBP',
    '0.48',
    '0.0000',
    '14/12/2022'
  ],
  [
    'Burford Capital Ltd',
    'Not Rated',
    '7.0400',
    'GBP',
    '-2.43',
    '0.0000',
    '14/12/2022'
  ],
  [
    'Diageo PLC',
    '',
    '37.6450',
    'GBP',
    '0.13',
    '0.0000',
    '14/12/2022'
  ],
  [
    'E.ON SE',
    'Not Rated',
    '9.2860',
    'EUR',
    '1.38',
    '0.0000',
    '14/12/2022'
  ],
  [
    'Fidelity Special Values Ord',
    'Not Rated',
    '2.7200',
    'GBP',
    '-0.55',
    '0.0000',
    '14/12/2022'
  ],
  [
    'Gaming Realms PLC',
    'Not Rated',
    '0.2580',
    'GBP',
    '1.47',
    '0.0000',
    '14/12/2022'
  ],
  [
    'Gold Bullion Securities',
    'Not Rated',
    '167.8900',
    'USD',
    '-0.09',
    '0.0000',
    '14/12/2022'
  ],
  [
    'Haleon PLC',
    'Not Rated',
    '3.2110',
    'GBP',
    '0.60',
    '0.0000',
    '14/12/2022'
  ],
  [
    'Henderson Diversified Income Ord',
    '',
    '0.6820',
    'GBP',
    '-0.87',
    '0.0000',
    '14/12/2022'
  ],
  [
    'Herald Ord',
    '',
    '17.7600',
    'GBP',
    '-0.78',
    '0.0000',
    '14/12/2022'
  ],
  [
    'Hotel Chocolat Group PLC',
    'Not Rated',
    '1.4000',
    'GBP',
    '-1.41',
    '0.0000',
    '14/12/2022'
  ],
  [
    'HSBC S&P 500 ETF GBP',
    '',
    '32.9555',
    'GBP',
    '-0.67',
    '0.0000',
    '14/12/2022'
  ],
  [
    'Imperial Brands PLC',
    '',
    '20.3400',
    'GBP',
    '0.69',
    '0.0000',
    '14/12/2022'
  ],
  [
    'IP Group PLC',
    'Not Rated',
    '0.5995',
    'GBP',
    '-1.32',
    '0.0000',
    '14/12/2022'
  ],
  [
    'iShares $ Treasury Bd 20 y ETF USD Di...',
    '',
    '3.2425',
    'GBP',
    '0.42',
    '16.6667',
    '14/12/2022'
  ],
  [
    'iShares China Large Cap ETF USD Dist GBP',
    '',
    '66.2450',
    'GBP',
    '0.91',
    '0.0000',
    '14/12/2022'
  ],
  [
    'iShares Core FTSE 100 ETF GBP Dist',
    '',
    '7.3510',
    'GBP',
    '-0.09',
    '0.0000',
    '14/12/2022'
  ],
  [
    'iShares MSCI Brazil ETF',
    '',
    '26.4700',
    'USD',
    '-0.19',
    '0.0000',
    '14/12/2022'
  ],
  [
    'iShares MSCI EM ETF USD Dist GBP',
    '',
    '30.4413',
    'GBP',
    '0.80',
    '0.0000',
    '14/12/2022'
  ],
  [
    'iShares Physical Gold ETC',
    'Not Rated',
    '35.2850',
    'USD',
    '-0.27',
    '0.0000',
    '14/12/2022'
  ],
  [
    'ITV PLC',
    'Not Rated',
    '0.7668',
    'GBP',
    '0.08',
    '0.0000',
    '14/12/2022'
  ],
  [
    'JPMorgan Indian Ord',
    '',
    '8.3600',
    'GBP',
    '-0.24',
    '0.0000',
    '14/12/2022'
  ],
  [
    'MannKind Corp',
    'Not Rated',
    '4.9500',
    'USD',
    '1.15',
    '0.0000',
    '14/12/2022'
  ],
  [
    'Merck & Co Inc',
    '',
    '111.5500',
    'USD',
    '1.16',
    '0.0000',
    '14/12/2022'
  ],
  [
    'Mobeus Income & Growth 4 VCT Ord',
    'Not Rated',
    '0.7650',
    'GBP',
    '0.00',
    '0.0000',
    '14/12/2022'
  ],
  [
    'National Grid PLC',
    '',
    '10.2750',
    'GBP',
    '1.18',
    '0.0000',
    '14/12/2022'
  ],
  [
    'Pantheon Infrastructure Ord',
    'Not Rated',
    '0.9440',
    'GBP',
    '0.85',
    '16.6667',
    '14/12/2022'
  ],
  [
    'PepsiCo Inc',
    '',
    '183.3600',
    'USD',
    '-0.35',
    '0.0000',
    '14/12/2022'
  ],
  [
    'Picton Property Income Ltd',
    'Not Rated',
    '0.8230',
    'GBP',
    '-0.60',
    '0.0000',
    '14/12/2022'
  ],
  [
    'RWE AG Class A',
    '',
    '42.7300',
    'EUR',
    '1.05',
    '0.0000',
    '14/12/2022'
  ],
  [
    'Samsung Electronics Co Ltd GDR',
    '',
    '1,163.0000',
    'USD',
    '-0.29',
    '0.0000',
    '14/12/2022'
  ],
  [
    'Schroder UK Mid Cap Ord',
    '',
    '5.4500',
    'GBP',
    '0.18',
    '0.0000',
    '14/12/2022'
  ],
  [
    'Shell PLC',
    '',
    '22.9600',
    'GBP',
    '-0.95',
    '0.0000',
    '14/12/2022'
  ],
  [ 'SSE PLC', '', '17.2550', 'GBP', '1.00', '0.0000', '14/12/2022' ],
  [
    'Standard Chartered PLC',
    '',
    '6.0660',
    'GBP',
    '-0.23',
    '0.0000',
    '14/12/2022'
  ],
  [
    'The Income & Growth VCT Ord',
    'Not Rated',
    '0.7600',
    'GBP',
    '0.00',
    '0.0000',
    '14/12/2022'
  ],
  [
    'Tritax Big Box Ord',
    'Not Rated',
    '1.5180',
    'GBP',
    '1.81',
    '0.0000',
    '14/12/2022'
  ],
  [
    'UBS(Lux)FS MSCI EMU GBPH Adis',
    'Not Rated',
    '11.1290',
    'GBP',
    '-0.13',
    '0.0000',
    '14/12/2022'
  ],
  [
    'United Utilities Group PLC',
    '',
    '10.3650',
    'GBP',
    '0.24',
    '0.0000',
    '14/12/2022'
  ],
  [
    'VanEck JPM EM LC Bd ETF A USD',
    '',
    '53.7600',
    'USD',
    '-0.18',
    '0.0000',
    '14/12/2022'
  ],
  [
    'Vanguard FTSE 100 UCITS ETF',
    '',
    '32.8900',
    'GBP',
    '-0.09',
    '0.0000',
    '14/12/2022'
  ],
  [
    'Vanguard S&P 500 ETF USD Acc GBP',
    '',
    '60.4600',
    'GBP',
    '-0.67',
    '0.0000',
    '14/12/2022'
  ],
  [
    'Viatris Inc',
    'Not Rated',
    '11.3200',
    'USD',
    '-0.25',
    '0.0000',
    '14/12/2022'
  ],
  [
    'Wells Fargo & Co',
    '',
    '42.1800',
    'USD',
    '-1.11',
    '0.0000',
    '14/12/2022'
  ]
]
53

  • Related