SOLVED BY @Barry the Platipus (see solution at the end)
I am trying to get "Apertura" from the following HTML:
I can't seem to get anything. I have been looking for a while and understand that a tag has no attribute "cmp-ltrk". So I have tried to get the above and then try to access the "cmp-ltrk" from there but I am just not succeeding.
What should I do?
This is my code when I try to get the "cmp-ltrk" directly from the "a" tag:
url = "https://www.eldiario.es/"
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")
a_tags = soup.find_all("a")
for i in a_tags:
print(i)
Which gives me:
<a href="https://www.eldiario.es/sociedad/educacion-no-recuperarse-recortes-pp-impuso-decreto-decada_1_9559396.html">
Educación
</a>
And this is the code when trying from the "h2":
url = "https://www.eldiario.es/"
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")
a_tags = soup.find_all("h2", {"class": "ni-title"})
for i in a_tags:
print(i)
I get something like:
<h2 ><a href="/sociedad/educacion-no-recuperarse-recortes-pp-impuso-decreto-decada_1_9559396.html">La Educación no consigue recuperarse de los recortes que el PP impuso por decreto hace una década</a></h2>
But then I never see the "cmp-ltrk" I am looking for and don't know how to access it.
Note that in both loops I have tried literally every single method for i (i.text, i.next, etc)
Thank you in advance.
SOLUTION
"Attributes like cmp-ltrk are added dynamically by javascript, once page loads. Requests cannot see them. You need playwright/selenium for this task (less complex solution), or you can go for a js2py solution (more complex). – Barry the Platipus"
This code worked:
df = pd.DataFrame(columns=["seccion","titulo","link","cmp"])
options = webdriver.ChromeOptions()
options.add_argument('--headless')
driver = webdriver.Chrome(chrome_options=options)
url = "https://www.eldiario.es/"
driver.get(url)
page_source = driver.page_source
soup = BeautifulSoup(page_source, "lxml")
tags = soup.find_all("h2", {"class": "ni-title"})
for i in tags:
l=[]
i.a["href"]
cmp = i.a["cmp-ltrk"]
href = i.a["href"]
if href.startswith("http"):
link = href
else:
link = "https://www.eldiario.es{0}".format(i.a["href"])
titulo = i.text
seccion = i.a["href"].split("/")[1]
l.append([seccion,titulo,link,cmp])
df_ = pd.DataFrame(l,columns=["seccion","titulo","link","cmp"])
df = pd.concat([df,df_],axis=0)
CodePudding user response:
The following is a workaround to get all links with their respective cmp-ltrk
attribute:
import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
big_list = []
url = "https://www.eldiario.es/"
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"
}
def has_specific_attr(tag):
return tag.has_attr('data-mrf-recirculation')
r = requests.get(url, headers=headers)
soup = bs(r.text, 'html.parser')
all_elems_with_attrs = soup.find_all(has_specific_attr)
for x in all_elems_with_attrs:
cmp_ltrk_attribute = x.get('data-mrf-recirculation')
urls = x.select('a')
for url in urls:
big_list.append((cmp_ltrk_attribute, url.get_text(strip=True), url.get('href')))
df = pd.DataFrame(big_list, columns = ['CMP-LTRK', 'Title', 'Link'])
print(df)
Result in terminal:
CMP-LTRK Title Link
0 hoy-hablamos-de Educación https://www.eldiario.es/sociedad/educacion-no-recuperarse-recortes-pp-impuso-decreto-decada_1_9559396.html
1 hoy-hablamos-de Rebajas fiscales https://www.eldiario.es/economia/rebajas-fiscales-evidencian-ideologia-impuestos-derecha-favorece-ricos-izquierda-trabajadores_1_9583778.html
2 hoy-hablamos-de Feijóo https://www.eldiario.es/internacional/ultima-hora-invasion-rusa-https://www.eldiario.es/politica/feijoo-cobija-bipartidismo-elite-empresarial-ofensiva-fiscal-sanchez_129_9587317.html-directo-1-de-octubre_6_9586346_1094209.html
3 hoy-hablamos-de Lula da Silva https://www.eldiario.es/internacional/lula-enfrenta-bolsonaro-duelo-historico-brasil_1_9567743.html
4 hoy-hablamos-de Liz Truss https://www.eldiario.es/politica/ansia-reducir-impuestos-liz-truss-hunda-economia-britanica-dias_129_9583720.html
... ... ... ...
677 Navegamos-con https://www.yorokobu.es/monotasking/
678 Navegamos-con Monotasking, la revolución de hacer una sola cosa a la vez https://www.yorokobu.es/monotasking/
679 Navegamos-con /responsables/
680 Navegamos-con https://www.eldiario.es/responsables/ventajas-darse-alta-autonomo-discapacidad_1_9160670.html
681 Navegamos-con Estas son las ventajas de darse de alta de autónomo con discapacidad https://www.eldiario.es/responsables/ventajas-darse-alta-autonomo-discapacidad_1_9160670.html
682 rows × 3 columns
The solution is based on inspecting the page loaded without Javascript, and searching for term Apertura
: the only element with such attribute value is a div - attribute being data-mrf-recirculation
. The assumption was that all links in that div will be enriched with cmp-ltrk="<the data-mrf-recirculation value of the parent element>"
.