Home > Enterprise >  How to scrape URL if it doesn't appear when download HTML? Javascript might be a problem here
How to scrape URL if it doesn't appear when download HTML? Javascript might be a problem here

Time:03-12

I am trying to scrape some URLs of this homepage (www.globo.com). I can get the headline and others URLs. But some of them aren't on the HTML and couldn't be scraped with requests and lxml. I don't want to use selenium/bs4/beautifulsoap because the code will be running on Heroku server, so it would make everything more difficult.

The URLs that I want to scrape are after a div with these two classes: container and false. This is mandatory. Others URLs without the class "false" on the div I can easily scrape.

Does anyone know how to scrape the URLs despite this problem? Or does someone recommend other library to this task (not bs4 or selenium)?

import requests
import lxml.html

url = 'https://www.globo.com/'
page = requests.get(url)
doc = lxml.html.fromstring(page.content)
urls = doc.xpath('//div[@]//a/@href')
print(urls)

This also doesn't work:

import requests
import lxml.html

url = 'https://www.globo.com/'
page = requests.get(url)
doc = lxml.html.fromstring(page.content)
urls = doc.xpath('//div[contains(@class, "container") and contains(@class, "false")]//a/@href')
print(urls)

Thank you

CodePudding user response:

Turns out that the "missing" URL's are actually in the source but you need to do a bit of digging.

Basically, these are loaded by JS from an embedded JSON. You can target the divs the JSON sits in and extract all the data for a given column.

Here's how to do that:

import json

import requests
from lxml import html

source = html.fromstring(requests.get('https://www.globo.com/').content)
columns = ["esporte", "jornalismo", "entretenimento"]

for column in columns:
    column_data = (
        json.loads(
            source.xpath(f'//div[@id="column-{column}"]')[0].get(f"data-{column}")
        )
    )
    for item in column_data:
        try:
            print(item["content"]["url"])
            print(f'Item id: {item["id"]}')
            print("-" * 120)
        except KeyError:
            continue

This should produce:

https://ge.globo.com/futebol/times/corinthians/noticia/2022/03/11/junior-moraes-e-aprovado-em-exames-cardiologicos-antes-de-assinar-com-o-corinthians.ghtml
Item id: 527df4d0-2310-4c6c-bda7-7215e2c43ce2
------------------------------------------------------------------------------------------------------------------------
https://ge.globo.com/futebol/times/sao-paulo/noticia/2022/03/11/passo-a-passo-entenda-a-polemica-entre-ceni-diego-costa-e-o-medico-do-sao-paulo-no-classico.ghtml
Item id: 6516b867-c2ca-412b-9a7b-52aca2a58b2d
------------------------------------------------------------------------------------------------------------------------
https://ge.globo.com/pe/futebol/noticia/2022/03/11/joelinton-diz-que-nao-conhece-oasis-e-sugere-alceu-valenca-para-musica-da-torcida-do-newcastle.ghtml
Item id: 3d30a7a6-2e13-44a0-957f-85e9ccd4a389
------------------------------------------------------------------------------------------------------------------------
https://oglobo.globo.com/esportes/futebol/apresentado-no-botafogo-piazon-mostra-empolgacao-com-projeto-da-saf-expectativa-grande-25428998?utm_source=globo.com&utm_medium=oglobo
Item id: f33b3e35-a9b9-4f0d-bb9e-95f145d54046
------------------------------------------------------------------------------------------------------------------------
https://ge.globo.com/futebol/times/corinthians/noticia/2022/03/11/joao-victor-cita-intensidade-maior-nos-treinos-do-corinthians-e-cre-em-evolucao-mais-adaptados.ghtml
Item id: c1306207-e4af-41da-ac65-b4bdc5bc6489
------------------------------------------------------------------------------------------------------------------------
https://extra.globo.com/famosos/jogador-douglas-luiz-da-selecao-brasileira-namora-companheira-de-clube-na-inglaterra-casal-posta-cliques-romanticos-25427740.html
Item id: 44de3874-5143-48a9-89ad-3da8b4c5e0d7
------------------------------------------------------------------------------------------------------------------------
https://ge.globo.com/am/futebol/times/amazonas-fc/noticia/2022/03/11/atacante-walter-ex-santa-cruz-e-anunciado-pelo-amazonas-fc.ghtml
Item id: d98971c4-b220-4c69-bc92-8c877a389951
------------------------------------------------------------------------------------------------------------------------
https://ge.globo.com/programas/verao-espetacular/noticia/2022/03/11/tecnologia-ajuda-surfistas-na-busca-por-ondulacoes-historicas-em-nazare.ghtml
Item id: 2737af2e-ee76-41c2-852c-cc0f0d00e01a
------------------------------------------------------------------------------------------------------------------------
https://revistaquem.globo.com/QUEM-News/noticia/2022/03/popo-tatua-cena-de-luta-com-whindersson-na-pele-e-no-coracao.html
Item id: 1505eee1-18df-4fa8-aa53-77feb10a5129
------------------------------------------------------------------------------------------------------------------------
https://ge.globo.com/combate/noticia/2022/03/11/ufc-marreta-e-ankalaev-batem-peso-rapido-para-luta-no-sabado.ghtml
Item id: c87ceed9-e9c1-47e5-a269-406b0c4a7636
------------------------------------------------------------------------------------------------------------------------
https://g1.globo.com/ce/ceara/noticia/2022/03/11/tres-dias-de-viagem-e-minha-irma-so-chorando-chamando-o-nome-da-minha-mae-diz-garoto-que-viajou-sem-responsavel-de-sao-paulo-ao-ceara.ghtml
Item id: 4379555e-9892-4e43-998a-567f8f4f1eb5
------------------------------------------------------------------------------------------------------------------------
https://g1.globo.com/es/espirito-santo/noticia/2022/03/11/brasileiro-e-preso-na-tailandia-com-cocaina-diluida-em-produtos-de-beleza.ghtml
Item id: e914e223-cb1a-43bb-89f5-eaafc6b475fa
------------------------------------------------------------------------------------------------------------------------
https://revistacrescer.globo.com/Saude/noticia/2022/03/apos-vencer-covid-19-e-um-quadro-de-pneumonia-menina-de-3-anos-sai-do-hospital-e-corre-para-abracar-prima.html
Item id: f60fdd89-01da-44c2-b131-ae132e1345c2
------------------------------------------------------------------------------------------------------------------------
https://g1.globo.com/pr/norte-noroeste/noticia/2022/03/11/justica-nega-posse-de-professora-sem-vacina-contra-covid-para-dar-aulas-na-rede-municipal-de-londrina.ghtml
Item id: 461a0f34-51fa-419b-8c72-80234ce05302
------------------------------------------------------------------------------------------------------------------------
https://g1.globo.com/to/tocantins/noticia/2022/03/11/mauro-carlesse-se-pronuncia-nas-redes-sociais-apos-renuncia-cheguei-no-limite.ghtml
Item id: d1f0e0fb-7ac5-4975-a7ba-ef7746099073
------------------------------------------------------------------------------------------------------------------------
https://revistagalileu.globo.com/Um-So-Planeta/noticia/2022/03/novas-observacoes-mostram-que-gelo-do-artico-afinou-nos-ultimos-3-anos.html
Item id: 5cd0e336-970f-48b6-a2cd-db59cab98964
------------------------------------------------------------------------------------------------------------------------
https://g1.globo.com/sp/ribeirao-preto-franca/noticia/2022/03/11/homem-fotografa-partes-intimas-de-mulher-de-saia-em-loja-de-sertaozinho-sp-video.ghtml
Item id: 6521e28a-fd0a-4666-864d-658d119ff31f
------------------------------------------------------------------------------------------------------------------------
https://g1.globo.com/ba/bahia/noticia/2022/03/11/passageiros-relatam-problema-nas-duas-linhas-do-metro-de-salvador.ghtml
Item id: 29428049-e314-41d0-a693-f2b71b259c79
------------------------------------------------------------------------------------------------------------------------
https://autoesporte.globo.com/curiosidades/noticia/2022/03/maior-carro-do-mundo-tem-26-rodas-heliponto-e-pode-levar-ate-75-pessoas.ghtml
Item id: 79fbc28c-4629-4403-bfc8-7e4511c33d8b
------------------------------------------------------------------------------------------------------------------------
https://revistacrescer.globo.com/Gravidez/noticia/2022/03/coercao-reprodutiva-em-documentario-mulheres-contam-que-parceiros-esconderam-suas-pilulas-anticoncepcionais-e-furaram-preservativos.html
Item id: 471eb31e-b5ae-4d9d-ae12-e9a6e74d1cd3
------------------------------------------------------------------------------------------------------------------------
https://g1.globo.com/fantastico/noticia/2022/03/11/uma-tarde-com-jade-apos-deixar-o-bbb-22-influencer-curtiu-praia-no-rio-e-atendeu-fas.ghtml
Item id: 5d4f5867-5001-498e-a1fe-d189a7adaed2
------------------------------------------------------------------------------------------------------------------------
https://revistaquem.globo.com/QUEM-News/noticia/2022/03/felipe-roque-curte-praia-com-atriz-sofia-starling-ex-de-andre-marques.html
Item id: fa1ead5a-0f95-43d0-bca8-ebd754421104
------------------------------------------------------------------------------------------------------------------------
https://revistaquem.globo.com/QUEM-Inspira/noticia/2022/03/entenda-frontoplastia-procedimento-para-diminuir-testa-feito-pela-ex-bbb-thais-braz.html
Item id: 0a4cebdf-0093-46b0-a192-b34c56e31e44
------------------------------------------------------------------------------------------------------------------------
https://vogue.globo.com/celebridade/noticia/2022/03/gabi-martins-confirma-que-ficou-felipe-neto-mas-descarta-relacionamento-estamos-solteiros.html
Item id: 60b7edde-0ef1-4280-a804-dcccb70c8197
------------------------------------------------------------------------------------------------------------------------
https://revistaquem.globo.com/QUEM-News/noticia/2022/03/jamie-lee-curtis-mostra-corpo-real-em-novo-papel-chupava-barriga-desde-os-11-anos.html
Item id: 07b8250e-4098-44a9-8bac-f70913867aa8
------------------------------------------------------------------------------------------------------------------------
https://gshow.globo.com/tudo-mais/tv-e-famosos/noticia/pergunta-de-susana-vieira-no-encontro-bomba-na-web-posso-falar-mal.ghtml
Item id: 08aacc1c-abd3-477a-a8b9-ffd5d1ab174f
------------------------------------------------------------------------------------------------------------------------
https://revistaquem.globo.com/Entrevista/noticia/2022/03/titi-muller-sobre-relacao-com-o-ex-marido-gente-quer-ver-o-outro-feliz.html
Item id: 05c58354-7b1b-4bd7-b8b6-48459cbfeec0
------------------------------------------------------------------------------------------------------------------------
https://gshow.globo.com/novelas/um-lugar-ao-sol/vem-por-ai/noticia/um-lugar-ao-sol-christianrenato-fica-entre-os-ciumes-de-barbara-e-as-exigencias-de-stephany.ghtml
Item id: f39cce20-143b-4ba7-a728-86bacedde3e0
------------------------------------------------------------------------------------------------------------------------
https://glamour.globo.com/lifestyle/noticia/2022/03/deborah-secco-exibe-marquinha-de-biquini-na-praia-e-ganha-elogio-do-marido.ghtml
Item id: f3d60eb1-3329-48f3-8eb1-7647ca353558
------------------------------------------------------------------------------------------------------------------------
https://glamour.globo.com/lifestyle/noticia/2022/03/kim-kardashian-compartilha-primeira-foto-no-instagram-ao-lado-de-pete-davidson.ghtml
Item id: 284d4a38-12cd-45ce-850a-ad436512444a
------------------------------------------------------------------------------------------------------------------------

NOTE: Some items have an ID but don't have an URL, these are usually widgets. Hence, the try-except.

  • Related