Home > Software engineering >  Scraping <meta> content with Cheerio
Scraping <meta> content with Cheerio

Time:06-22

Tanaike helped me with this amazing script using Cheerio. The original was for Letterboxd, but this one is to pull in a watchlist from Trakt.tv (sample watchlist).

As it is right now it pulls in the watched date and the title, but I'd also like to pull in the content from the meta tag for each item.

<meta content="8 Million Ways to Die (1986)" itemprop="name">

I tried using $('[itemprop="name"]').attr('content'); but it doesn't accept the .attr piece.

Here is the full script as it is now, returning the watched date in the Col1 and the title in Col2.

/**
* Returns Trakt watchlist by username
* @param pages enter the number of pages in the list. Default is 10
* @customfunction
*/
function TRAKT(pages=10) {
    const username = `jerrylaslow`
    const maxPage = pages; 
    const reqs = [...Array(maxPage)].map((_, i) => ({ url: `https://trakt.tv/users/`  username  `/history/all/added?page=${i   1}`, muteHttpExceptions: true }));
    return UrlFetchApp.fetchAll(reqs).flatMap((r, i) => {
      if (r.getResponseCode() != 200) {
        return [["Values couldn't be retrieved.", reqs[i].url]];
      }
      const $ = Cheerio.load(r.getContentText());
      const ar = $(`a.titles-link > h3.ellipsify, h4 > span.format-date`).toArray();
      return [...Array(Math.ceil(ar.length / 2))].map((_) => {
        const temp = ar.splice(0, 2);
        const watchDate = Utilities.formatDate(new Date($(temp[1]).text().trim().replace(/T|Z/g, " ")),"GMT","yyyy-MM-dd");
        const title = $(temp[0]).text().trim();
        return [watchDate,title];
      });
    });
  }

The values can be pulled with this, so I know there isn't any sort of blocking in play.

=IMPORTXML(
  "https://trakt.tv/users/jerrylaslow/history",
  "//meta[@itemprop='name']/@content")

Any help is appreciated.

CodePudding user response:

In order to achieve your goal, in your script, how about the following modification?

Modified script:

function TRAKT(pages = 10) {
  const username = `jerrylaslow`
  const maxPage = pages;
  const reqs = [...Array(maxPage)].map((_, i) => ({ url: `https://trakt.tv/users/`   username   `/history/all/added?page=${i   1}`, muteHttpExceptions: true }));
  return UrlFetchApp.fetchAll(reqs).flatMap((r, i) => {
    if (r.getResponseCode() != 200) {
      return [["Values couldn't be retrieved.", reqs[i].url]];
    }
    const $ = Cheerio.load(r.getContentText());
    const ar = $(`a.titles-link > h3.ellipsify, h4 > span.format-date`).toArray();
    return [...Array(Math.ceil(ar.length / 2))].map((_) => {
      const temp = ar.splice(0, 2);
      const c = $(temp[0]).parent('a').parent('div').parent('div').find('meta').toArray().find(ff => $(ff).attr("itemprop") == "name"); // Added
      const watchDate = Utilities.formatDate(new Date($(temp[1]).text().trim().replace(/T|Z/g, " ")), "GMT", "yyyy-MM-dd");
      const title = $(temp[0]).text().trim();
      return [watchDate, title, c ? $(c).attr("content") : ""]; // Modified
    });
  });
}
  • When this modified script is run, the value of content is put to 3rd column. If you want to put it to other column, please modify return [watchDate, title, c ? $(c).attr("content") : ""];.
  • Related