I'm trying to scrape this site to retrieve the years of each paper thats been published. I've managed to get titles to work but when it comes to scraping the years it returns none.
I've broken it down and the results of 'none' occur when its going into the for loop but I can't figure out why this happens when its worked with titles.
import requests
from bs4 import BeautifulSoup
URL = "https://dblp.org/search?q=web scraping"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find(class_="publ-list")
paperResults = results.find_all(class_="data tts-content")
for singlepaper in paperResults:
paperyear = singlepaper.find(class_="datePublished")
print(paperyear)
When it goes to paperResults it gives the breakdown of the section I've selected within the results on the line above that. Any suggestions on how to retrieve the years would be greatly appreciated
CodePudding user response:
Try this:
import requests
from bs4 import BeautifulSoup
URL = "https://dblp.org/search?q=web scraping"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find(class_="publ-list")
paperResults = results.find_all(class_="data tts-content")
for singlepaper in paperResults:
paperyear = singlepaper.find(attrs={"itemprop": "datePublished"})
print(paperyear)
It worked for me.
CodePudding user response:
Change this
for singlepaper in paperResults:
paperyear = singlepaper.find(class_="datePublished")
print(paperyear)
To this
for singlepaper in paperResults:
paperyear = singlepaper.find('span', itemprop="datePublished")
print(paperyear.string)
You were looking for a class when you needed to be parsing span... if you print paperResults
you will see that your datePublished
is an itemprop
in a span
element.