Home > other >  beautifulsoup returns none when using find() for an element
beautifulsoup returns none when using find() for an element

Time:02-19

I'm trying to scrape this site to retrieve the years of each paper thats been published. I've managed to get titles to work but when it comes to scraping the years it returns none.

I've broken it down and the results of 'none' occur when its going into the for loop but I can't figure out why this happens when its worked with titles.

import requests 
from bs4 import BeautifulSoup

URL = "https://dblp.org/search?q=web scraping"
page = requests.get(URL)

soup = BeautifulSoup(page.content, "html.parser")
results = soup.find(class_="publ-list")
paperResults = results.find_all(class_="data tts-content")
for singlepaper in paperResults:
    paperyear = singlepaper.find(class_="datePublished")
    print(paperyear)

When it goes to paperResults it gives the breakdown of the section I've selected within the results on the line above that. Any suggestions on how to retrieve the years would be greatly appreciated

CodePudding user response:

Try this:

import requests 
from bs4 import BeautifulSoup

URL = "https://dblp.org/search?q=web scraping"
page = requests.get(URL)

soup = BeautifulSoup(page.content, "html.parser")
results = soup.find(class_="publ-list")
paperResults = results.find_all(class_="data tts-content")
for singlepaper in paperResults:
    paperyear = singlepaper.find(attrs={"itemprop": "datePublished"})
    print(paperyear)

It worked for me.

CodePudding user response:

Change this

for singlepaper in paperResults:
    paperyear = singlepaper.find(class_="datePublished")
    print(paperyear)

To this

for singlepaper in paperResults:
    paperyear = singlepaper.find('span', itemprop="datePublished")
    print(paperyear.string)

You were looking for a class when you needed to be parsing span... if you print paperResults you will see that your datePublished is an itemprop in a span element.

  • Related