Home > Blockchain >  Get only HTML <head> from URL
Get only HTML <head> from URL

Time:08-03

My question is similar to this one about Python, but, unlike it, mine is about Javascript.

1. The problem

  1. I have a large list of Web Page URLs (about 10k) in plain text;
  2. For each page@URL (or for majority of) I need to find some metadata and a title;
  3. I want to NOT LOAD full pages, only load everything before </head> closing tag.

2. The questions

  1. Is it possible to open a stream, load some bytes and, upon getting to the </head>, close stream and connection? If so, how?
  2. Py's urllib.request.Request.read() has a "size" argument in number of bytes, but JS's ReadableStreamDefaultReader.read() does not. What should I use in JS then as an alternative?
  3. Will this approach reduce network traffic, bandwidth usage, CPU and memory usage?

CodePudding user response:

I don't know if there is a method in which you can get only the head element from a response, but you can load the entire HTML document and then parse the head from it even though it might not be so efficient compared to other methods. I made a basic app using axios and cheerio to get the head element from an array of urls. I hope this might help someone.

const axios = require("axios")
const cheerio = require("cheerio")

const URLs = ["https://stackoverflow.com/questions/73191546/get-only-html-head-from-url"] 

for (let i = 0; i < URLs.length; i  ) {
    axios.get(URLs[i])
    .then(html => {
        const document = html.data


        // get the start index and the end index of the head
        const startHead = document.indexOf("<head>")
        const endHead = document.indexOf("</head>")   7

        //get the head as a string
        const head = document.slice(startHead, endHead)

        
        // load cheerio
        const $ = cheerio.load(head)

        // get the title from the head which is loaded into cheerio
        console.log($("title").html())

    })
    .catch(e => console.log(e))
}

CodePudding user response:

Answer for question 2:

Try use node-fetch's fetch(url, {size: 200})

https://github.com/node-fetch/node-fetch#fetchurl-options

  • Related