Home > other >  BeautifulSoup: how to get everything inside <a> tag?
BeautifulSoup: how to get everything inside <a> tag?

Time:10-03

SOLVED BY @Barry the Platipus (see solution at the end)

I am trying to get "Apertura" from the following HTML:

HTML

I can't seem to get anything. I have been looking for a while and understand that a tag has no attribute "cmp-ltrk". So I have tried to get the above and then try to access the "cmp-ltrk" from there but I am just not succeeding.

What should I do?

This is my code when I try to get the "cmp-ltrk" directly from the "a" tag:

url = "https://www.eldiario.es/"

response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")

a_tags = soup.find_all("a")

for i in a_tags:
    print(i)

Which gives me:

<a  href="https://www.eldiario.es/sociedad/educacion-no-recuperarse-recortes-pp-impuso-decreto-decada_1_9559396.html">
            Educación
        </a>

And this is the code when trying from the "h2":

url = "https://www.eldiario.es/"

response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")

a_tags = soup.find_all("h2", {"class": "ni-title"})

for i in a_tags:
    print(i)

I get something like:

<h2 ><a href="/sociedad/educacion-no-recuperarse-recortes-pp-impuso-decreto-decada_1_9559396.html">La Educación no consigue recuperarse de los recortes que el PP impuso por decreto hace una década</a></h2>

But then I never see the "cmp-ltrk" I am looking for and don't know how to access it.

Note that in both loops I have tried literally every single method for i (i.text, i.next, etc)

Thank you in advance.


SOLUTION

"Attributes like cmp-ltrk are added dynamically by javascript, once page loads. Requests cannot see them. You need playwright/selenium for this task (less complex solution), or you can go for a js2py solution (more complex). – Barry the Platipus"

This code worked:

df = pd.DataFrame(columns=["seccion","titulo","link","cmp"])

options = webdriver.ChromeOptions()
options.add_argument('--headless')
driver = webdriver.Chrome(chrome_options=options)

url = "https://www.eldiario.es/"
driver.get(url)

page_source = driver.page_source

soup = BeautifulSoup(page_source, "lxml")

tags = soup.find_all("h2", {"class": "ni-title"})    

for i in tags:
    l=[]

    i.a["href"]
    cmp = i.a["cmp-ltrk"]
    href = i.a["href"]

    if href.startswith("http"):
        link = href
    else:
        link = "https://www.eldiario.es{0}".format(i.a["href"])

    titulo = i.text
    seccion = i.a["href"].split("/")[1]   
    l.append([seccion,titulo,link,cmp])
    df_ = pd.DataFrame(l,columns=["seccion","titulo","link","cmp"])
    df = pd.concat([df,df_],axis=0)

CodePudding user response:

The following is a workaround to get all links with their respective cmp-ltrk attribute:

import requests
import pandas as pd
from bs4 import BeautifulSoup as bs

pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
big_list = []
url = "https://www.eldiario.es/"

headers = {
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"
    }

def has_specific_attr(tag):
    return tag.has_attr('data-mrf-recirculation')

r = requests.get(url, headers=headers)
soup = bs(r.text, 'html.parser')
all_elems_with_attrs = soup.find_all(has_specific_attr)
for x in all_elems_with_attrs:
    cmp_ltrk_attribute = x.get('data-mrf-recirculation')
    urls = x.select('a')
    for url in urls:
        big_list.append((cmp_ltrk_attribute, url.get_text(strip=True), url.get('href')))
df = pd.DataFrame(big_list, columns = ['CMP-LTRK', 'Title', 'Link'])
print(df)

Result in terminal:

    CMP-LTRK    Title   Link
0   hoy-hablamos-de Educación   https://www.eldiario.es/sociedad/educacion-no-recuperarse-recortes-pp-impuso-decreto-decada_1_9559396.html
1   hoy-hablamos-de Rebajas fiscales    https://www.eldiario.es/economia/rebajas-fiscales-evidencian-ideologia-impuestos-derecha-favorece-ricos-izquierda-trabajadores_1_9583778.html
2   hoy-hablamos-de Feijóo  https://www.eldiario.es/internacional/ultima-hora-invasion-rusa-https://www.eldiario.es/politica/feijoo-cobija-bipartidismo-elite-empresarial-ofensiva-fiscal-sanchez_129_9587317.html-directo-1-de-octubre_6_9586346_1094209.html
3   hoy-hablamos-de Lula da Silva   https://www.eldiario.es/internacional/lula-enfrenta-bolsonaro-duelo-historico-brasil_1_9567743.html
4   hoy-hablamos-de Liz Truss   https://www.eldiario.es/politica/ansia-reducir-impuestos-liz-truss-hunda-economia-britanica-dias_129_9583720.html
... ... ... ...
677 Navegamos-con       https://www.yorokobu.es/monotasking/
678 Navegamos-con   Monotasking, la revolución de hacer una sola cosa a la vez  https://www.yorokobu.es/monotasking/
679 Navegamos-con       /responsables/
680 Navegamos-con       https://www.eldiario.es/responsables/ventajas-darse-alta-autonomo-discapacidad_1_9160670.html
681 Navegamos-con   Estas son las ventajas de darse de alta de autónomo con discapacidad    https://www.eldiario.es/responsables/ventajas-darse-alta-autonomo-discapacidad_1_9160670.html
682 rows × 3 columns

​The solution is based on inspecting the page loaded without Javascript, and searching for term Apertura: the only element with such attribute value is a div - attribute being data-mrf-recirculation. The assumption was that all links in that div will be enriched with cmp-ltrk="<the data-mrf-recirculation value of the parent element>".

  • Related