Home > Back-end >  Solved: Extract date from my Substack webpage to Google Sheets
Solved: Extract date from my Substack webpage to Google Sheets

Time:07-31

longtime lurker, first-time poster. I usually solve my issues & upvote without needing to post, but I've been stumped all weekend!

Edit: Erik solved it: I was looking for an answer to extract the "datePublished" or "dateModified" from a Substack article in a Google Sheet.

Goal: This will tell me when it was the last date/time I updated, for example, my PS5 restock guide, my Walmart PS5 restock guide, etc. If it's too stale, I try to add relevant information. Having it in Google Sheets makes it streamlined as there are dozens of guides.

Test Google Sheet: https://docs.google.com/spreadsheets/d/1hLBFMWCTc2hpC-1C8Sxd5OVREdNHTVTtrJsAAU5Jl94/edit#gid=0

I've done this before for other sites I've worked at, but there appears to be no date in the meta data on Substack :/ (I could be wrong, as I'm no expert at reading XPATH)

I do see this in the body for the linked example:

<time datetime="2022-07-29T11:52:00.000Z">Jul 29</time>

I've been trying things like this (where E17 is where I put the article URL in Google Sheets) to no effect.

=REGEXEXTRACT(IMPORTXML(E17, "//time[@datetime='datePublished']/@content"), "(. )T")

I've been mostly working off of this StackOverflow solution, but I haven't been able to apply the same finding to Substack's formatting.

CodePudding user response:

If you want to grab it directly using a Google Sheets formula, this should work for you:

=ArrayFormula(IFERROR(VLOOKUP("*",FLATTEN(IFERROR(REGEXEXTRACT(IMPORTXML("https://www.theshortcut.com/p/ps5-restock","//div[2]"),"Swider(.?.?.?.?\d\d{1}[hrago\s]*)"))),1,FALSE),"???"))

To set realistic expectations, I usually can't invest this much time into working out such a solution on this forum. But I'm on vacation at the moment and filling time while my guest is otherwise occupied.

One further note: this is specific to the two sites you gave as examples. It will only work for sites where the second <div> holds this information and only where the data exists as strings exactly like those found on these two sites (including the poster's last name as "Swider").

ADDENDUM:

Looking at this further, did you try simply the following?

=IMPORTXML(C2, "//time")

(assuming your URL is in C2, etc.)

This seems to work for me, given that it appears the date/time data you want is contained within the first <time> element on the web page.

  • Related