My question is similar to this one about Python, but, unlike it, mine is about Javascript.
1. The problem
- I have a large list of Web Page URLs (about 10k) in plain text;
- For each page@URL (or for majority of) I need to find some metadata and a title;
- I want to NOT LOAD full pages, only load everything before
</head>
closing tag.
2. The questions
- Is it possible to open a stream, load some bytes and, upon getting to the
</head>
, close stream and connection? If so, how? - Py's
urllib.request.Request.read()
has a "size" argument in number of bytes, but JS'sReadableStreamDefaultReader.read()
does not. What should I use in JS then as an alternative? - Will this approach reduce network traffic, bandwidth usage, CPU and memory usage?
CodePudding user response:
I don't know if there is a method in which you can get only the head element from a response, but you can load the entire HTML document and then parse the head from it even though it might not be so efficient compared to other methods. I made a basic app using axios and cheerio to get the head element from an array of urls. I hope this might help someone.
const axios = require("axios")
const cheerio = require("cheerio")
const URLs = ["https://stackoverflow.com/questions/73191546/get-only-html-head-from-url"]
for (let i = 0; i < URLs.length; i ) {
axios.get(URLs[i])
.then(html => {
const document = html.data
// get the start index and the end index of the head
const startHead = document.indexOf("<head>")
const endHead = document.indexOf("</head>") 7
//get the head as a string
const head = document.slice(startHead, endHead)
// load cheerio
const $ = cheerio.load(head)
// get the title from the head which is loaded into cheerio
console.log($("title").html())
})
.catch(e => console.log(e))
}
CodePudding user response:
Answer for question 2:
Try use node-fetch's fetch(url, {size: 200})