I have this website that I want to get the element "datePublished", but this information is inside a Json. Is there anyway to get this information through BeautifulSoup?
This is the section of html I'm talking about:
<script type="application/ld json">{ "@context": "http://schema.org", "@type": "NewsArticle", "mainEntityOfPage": "http://cdn.ampproject.org/article-metadata.html", "headline": "Startup Ebanx demite 340 e amplia 'crise dos unicórnios'", "datePublished": "2022-06-21T14:31:57-03:00", "dateModified": "2022-06-21T18:44:45-03:00", "description": "Empresa de Curitiba cortou 20% do quadro de funcionários diante de mudanças no cenário macroeconômico", "author": { "@type": "Person", "name": "Guilherme Guerra" }, "image": { "@type": "ImageObject", "url": [ "https://img.estadao.com.br/fotos/crop/1200x1200/resources/jpg/6/4/1655828329546.jpg", "https://img.estadao.com.br/fotos/crop/1200x900/resources/jpg/6/4/1655828329546.jpg", "https://img.estadao.com.br/fotos/crop/1200x675/resources/jpg/6/4/1655828329546.jpg" ]}, "publisher": { "@type": "NewsMediaOrganization", "name": "Estadão", "foundingDate" : "1875-01-05", "ethicsPolicy" : "https://www.estadao.com.br/codigo-etica/codigo-de-etica.pdf", "missionCoveragePrioritiesPolicy" : "https://www.estadao.com.br/codigo-etica/codigo-de-etica.pdf", "diversityPolicy" : "https://www.estadao.com.br/codigo-etica/codigo-de-etica.pdf", "correctionsPolicy" : "https://www.estadao.com.br/codigo-etica/codigo-de-etica.pdf", "verificationFactCheckingPolicy" : "https://www.estadao.com.br/codigo-etica/codigo-de-etica.pdf", "unnamedSourcesPolicy" : "https://www.estadao.com.br/codigo-etica/codigo-de-etica.pdf", "sameAs":["https://twitter.com/estadao","https://www.facebook.com/estadao/","https://www.instagram.com/estadao/","https://www.youtube.com/channel/UCrtOL8bJsh-csozGS2aV77Q", "https://plus.google.com/ Estadão"], "logo": { "@type": "ImageObject", "url": "https://statics.estadao.com.br/s2016/portal/logos/logo-estadao-272x59.png", "width": 272, "height": 59 } }, "isAccessibleForFree":"False","hasPart":{"@type":"WebPageElement","isAccessibleForFree":"False","cssSelector":".pw-container"},"isPartOf":{"@type":["CreativeWork","Product"],"name":"Estad\u00e3o","productID":"estadao.com.br:dig_basic"}}</script>
This is a working code to get to that information:
import requests
from bs4 import BeautifulSoup
link = "https://link.estadao.com.br/noticias/inovacao,startup-ebanx-demite-340-e-amplia-crise-dos-unicornios,70004097585"
soup = BeautifulSoup(requests.get(link).text, 'html.parser')
url = soup.find("script", type="application/ld json")
url
CodePudding user response:
Try this:
import json
import requests
from bs4 import BeautifulSoup
link = "https://link.estadao.com.br/noticias/inovacao,startup-ebanx-demite-340-e-amplia-crise-dos-unicornios,70004097585"
soup = (
BeautifulSoup(requests.get(link).text, 'html.parser')
.find("script", type="application/ld json").string
)
print(json.loads(soup)["datePublished"])
Output:
2022-06-21T14:31:57-03:00