Home > Software design >  Extract text within divs using Google Sheet
Extract text within divs using Google Sheet

Time:04-08

I am trying to extract headlines from a BostonGlobe website enter image description here

CodePudding user response:

The Boston Globe site's content, including all headlines and articles, is generated with dynamic script, not static HTML. So you will not be able to extract with IMPORTXML.

CodePudding user response:

I can offer a workaround.

Go to the actual webpage and hit Ctrl-A to select the entire webpage contents. Hit Ctrl-C to copy that to clipboard. Then go to Google Sheets, click on A1 of any blank sheet and hit Ctrl-V to copy all of the webpage contents there. It will run down Col A.

Now, in some cell of another empty column (say, C2 for now), place this formula:

=ArrayFormula(REGEXREPLACE(REGEXREPLACE(FILTER(A:A,REGEXMATCH(A:A,"\/ By")),JOIN("|",TEXT(DATE(2022,SEQUENCE(1,12),1),"mmm")),"~"),"~\s\d . $",""))

This should extract your headlines from the content in Col A.

The FILTER uses REGEXMATCH to extract anything with a "By" line.

The JOIN part forms a REGEX "OR listing" of all three-letter month names. (The year is not important; it just has to be "something" in order for DATE to extract the month names from. And the innermost REGEXREPLACE will replace those three-letter month names with a tilde (~) as a marker of what to delete beyond.

The outermost REGEXREPLACE will then delete from there to the end of those strings, leaving only the headers.

No guarantees on other pages, but this process and formula(s) should work on all similar headline-listing pages from this website.

If you also need the hyperlinks to remain active, that gets tricky. You'll have to do it in two columns:

In the above example, we placed the first formula in C2. In D2, you could place this formula (taken from the first formula):

=ArrayFormula(FILTER(A:A,REGEXMATCH(A:A,"\/ By")))

This will extract the entire headline, by-line and opening details as a long string, but it will retain the hyperlink. Just change the Col-D width to something narrow like 50, then select the entire Col D and apply Format > Wrapping > Clip to truncate the overflow.

If you want to break the dependence on the webpage "data drop" in Col A so that you can delete all that:

1.) Select the formula column(s) entirely and hit Ctrl-C to copy to clipboard.

2.) Open a new Word document (or Google Docs, but I suggest Word if you have it, since it's outside of the Google empire and will fully detach more facets). Hit Ctrl-V to paste the formula results.

3.) Then hit Ctrl-A to select what you just dropped in the Word document, hit Ctrl-C to copy it back to clipboard, and then go back to Google Sheets [with the formula column(s) still select] and hit Ctrl-V.

4.) You'll need then apply Format > Wrapping > Clip to truncate the overflow to clip if desired.

Then you can delete the Col-A data.

It may seem like a lot to do. But honestly, in practice, it should only take less than a minute from start to finish.

  • Related