I wrote a web crawler with nodejs to send get requests to about 300 urls. Here is the main loop:
for (let i = 1; i <= 300; i ) {
let page= `https://xxxxxxxxx/forum-103-${i}.html`
await getPage(page,(arr)=>{
console.log(`page ${i}`)
})
}
Here is the function getPage(url,callback):
export default async function getPage(url, callback) {
await https.get(url, (res) => {
let html = ""
res.on("data", data => {
html = data
})
res.on("end", () => {
const $ = cheerio.load(html)
let obj = {}
let arr = []
obj = $("#threadlisttableid tbody")
for (let i in obj) {
if (obj[i].attribs?.id?.substr(0, 6) === 'normal') {
arr.push(`https://xxxxxxx/${obj[i].attribs.id.substr(6).split("_").join("-")}-1-1.html`)
}
}
callback(arr)
console.log("success!")
})
})
.on('error', (e) => {
console.log(`Got error: ${e.message}`);
})
}
I use cheerio to analyze HTML and put all information i need to variable nameed 'arr'. The program will report an error after running normally for a period of time,like that:
...
success!
page 121
success!
page 113
success!
page 115
success!
Got error: connect ETIMEDOUT 172.67.139.206:443
Got error: connect ETIMEDOUT 172.67.139.206:443
Got error: connect ETIMEDOUT 172.67.139.206:443
Got error: connect ETIMEDOUT 172.67.139.206:443
Got error: connect ETIMEDOUT 172.67.139.206:443
Got error: connect ETIMEDOUT 172.67.139.206:443
I have two questions:
1.What is the reason for the error? Is it because I am sending too many get requests? How can I limit the request frequency?
2.As you can see, The order in which the pages are accessed is chaotic,how to control them?
I have tried using other modules to send get request (such as Axios) but it didn't work.
CodePudding user response:
The http requests are fired simultaneously because the loop is not waiting for the previous request due to wrong use of await
. Proper control of loop will limit the request frequency.
for (let i = 1; i <= 300; i ) {
let page= `https://xxxxxxxxx/forum-103-${i}.html`
var arr = await getPage(page);
// use arr in the way you want
console.log(`page ${i}`);
}
export default async function getPage(url) {
// Declare a new promise, wait for the promise to resolve and return its value.
return await new Promise((reso, rej) => {
https.get(url, (res) => {
let html = ""
res.on("data", data => {
html = data
})
res.on("end", () => {
const $ = cheerio.load(html)
let obj = {}
let arr = []
obj = $("#threadlisttableid tbody")
for (let i in obj) {
if (obj[i].attribs?.id?.substr(0, 6) === 'normal') {
arr.push(`https://xxxxxxx/${obj[i].attribs.id.substr(6).split("_").join("-")}-1-1.html`)
}
}
res(arr) // Resolve with arr
console.log("success!")
})
})
.on('error', (e) => {
console.log(`Got error: ${e.message}`);
throw e;
})
})
}
CodePudding user response:
As you can see, The order in which the pages are accessed is chaotic,how to control them?
await
is meaningless unless you put a promise on the right hand side. http.get
does not deal in promises.
You could wrap it in a promise but it would be easier to use an API which supports then natively such as node-fetch, axios, or Node.js's native fetch. (That all have APIs that are, IMO, easier to use than http.get in general nor just with regards to flow control).
What is the reason for the error?
It isn't clear.
Is it because I am sending too many get requests?
That is a likely hypothesis.
How can I limit the request frequency?
Once you have your for
loop working with promises so the requests are sent in serial instead of parallel, you can insert a sleep between each request.
CodePudding user response:
etimedout => This is caused when your request response is not received in given time.