Home > OS >  Reading a web page with node.js and urllib
Reading a web page with node.js and urllib

Time:10-08

I'm learning programming and found myself in a tough spot; the code from the tutorial is not working and I can't understand why. It's a shell script that's supposed to retrieve a wikipedia page, strip it of the references, and return just the paragraphs text. It uses the urllib library. In the code below, the only difference from the tutorial's is the use of fs to make a text file with the page content. The rest is copied and pasted.

#!/usr/local/bin/node

// Returns the paragraphs from a Wikipedia link, stripped of reference numbers.

let urllib = require("urllib");
let url = process.argv[2];
let fs = require("fs");

console.log(url);

const jsdom = require("jsdom");
const { JSDOM } = jsdom;

urllib.request(url, { followRedirect: true }, function(error, data, response) {
  let body = data.toString();
  // Simulate a Document Object Model.
  let { document } = (new JSDOM(body)).window;

  // Grab all the paragraphs and references.
  let paragraphs = document.querySelectorAll("p");
  let references = document.querySelectorAll(".reference");

  // Remove any references.
  references.forEach(function(reference) {
    reference.remove();
  });

  // Print out all of the paragraphs.
  paragraphs.forEach(function(paragraph) {
    console.log(paragraph.textContent);
    fs.appendFileSync("article.txt", `${paragraph}\n`);
  });

});

My first guess, was that urllib was not working for some reason. This cause, even if I installed it as per official documentation, when I type which urllib at the command line, it doesn't return a path. But then, node doesn't return an error for not knowing what the require("urllib") is when I run the file.

The actual output is the following:

$ ./wikp https://es.wikipedia.org/wiki/JavaScript
https://es.wikipedia.org/wiki/JavaScript
$ 

Can anybody help please?

CodePudding user response:

I think the tutorial you followed might have been a little out of date. This works for me:

let urllib = require("urllib");
let url = process.argv[2];
let fs = require("fs");

console.log(url);

const jsdom = require("jsdom");
const { JSDOM } = jsdom;


urllib.request(url, { followRedirect: true }).then(({data, res}) => {
  let body = data.toString();
  // Simulate a Document Object Model.
  let { document } = (new JSDOM(body)).window;

  // Grab all the paragraphs and references.
  let paragraphs = document.querySelectorAll("p");
  let references = document.querySelectorAll(".reference");

  // Remove any references.
  references.forEach(function(reference) {
    reference.remove();
  });

  // Print out all of the paragraphs.
  paragraphs.forEach(function(paragraph) {
    console.log(paragraph.textContent);
    fs.appendFileSync("article.txt", `${paragraph.textContent}\n`);
  });
});

The package you are using (urllib) is using promises, that might have been different in the past, when the tutorial was released.

  • Related