Home > Software design >  Extract data from HTML string in Javascript
Extract data from HTML string in Javascript

Time:10-31

I have a nodeJS script that reads HTML from a file as string. I would like to extract some data from it. My string (it is a string not HTML) is as following:

<tr><td style="text-align: center;">Initial Filing</td></tr>
                                        
<tr><td>Debtor</td></tr>

    <tr><td class="dName">PO</td></tr>
    <tr><td class="dAddress">CLACKAMAS OR 97015</td></tr>

<tr><td>Secured Party</td></tr>

    <tr><td class="spName">AS</td></tr>
    <tr><td class="spAddress">SPRINGFIELD IL 62708</td></tr>
    
<tr><td>Debtor</td></tr>
    <tr><td class="dName">ONE</td></tr>
    <tr><td class="dAddress">CLACKAMAS OR 97015</td></tr>

<tr><td>Secured Party</td></tr>

    <tr><td class="spName">ANY</td></tr>
    <tr><td class="spAddress">SPRINGFIELD IL 62708</td></tr>

The JavaScrit code I'm using is:

fs.readFile('file.txt', 'utf8', function (err, data) {
        if (err) {
            console.log("Error reading file.txt", err);
            process.exit(1);
        }
        var cleanedHtml = /<tr><td>Debtor<\/td><\/tr>(.*?)<tr><td>Secured Party<\/td><\/tr>/g.exec(html);
        console.log(cleanedHtml[1]);
    });

It returns to me this:

 return cleanedHtml[1];
                      ^
TypeError: Cannot read property '1' of null

Is there any issue with my regex? Also, how can I have an end result like this:

PO
CLACKAMAS OR 97015

AS
SPRINGFIELD IL 62708
    
ONE
CLACKAMAS OR 97015

ANY
SPRINGFIELD IL 62708

Thanks.

CodePudding user response:

If you make sure that the tr elements are inside <table></table> then you can parse the string using DOMParser() after reading the file:

Demo:

var strHtml = `
  <table>
    <tr><td style="text-align: center;">Initial Filing</td></tr>

    <tr><td>Debtor</td></tr>

    <tr><td >PO</td></tr>
    <tr><td >CLACKAMAS OR 97015</td></tr>

    <tr><td>Secured Party</td></tr>

    <tr><td >AS</td></tr>
    <tr><td >SPRINGFIELD IL 62708</td></tr>

    <tr><td>Debtor</td></tr>
    <tr><td >ONE</td></tr>
    <tr><td >CLACKAMAS OR 97015</td></tr>

    <tr><td>Secured Party</td></tr>

    <tr><td >ANY</td></tr>
    <tr><td >SPRINGFIELD IL 62708</td></tr>
  </table>
  `

var doc = new DOMParser().parseFromString(strHtml, 'text/html');
var els = doc.querySelectorAll('.dName,.spName,.dAddress,.spAddress');
els.forEach((el) => {
  console.log(el.textContent);
});
<iframe name="sif1" sandbox="allow-forms allow-modals allow-scripts" frameborder="0"></iframe>

CodePudding user response:

Should there not be brackets after console.log? Is the cleanedHtml a list with more than one element? Otherwise there is no cleanedHtml[1]

  • Related