How can I extract book names (strings) from scraped html?-CodePudding

I'm fairly new to the area of web scraping and I'm trying to extract the list of books available at the following url https://www.goodreads.com/choiceawards/best-fiction-books-2020. I'm trying to get all titles from the part e.g. title="The Midnight Library by Matt Haig" but wasn't very successful and I get back just an empty space and no result. Can anyone advice on that? Thanks.

What I've got so far:

from bs4 import BeautifulSoup as bs

url= "https://www.goodreads.com/choiceawards/best-fiction-books-2020" 
page = requests.get(url) 
soup = bs(page.content, 'html.parser') 

soup.find_all('a', class_='pollAnswer__bookLink')

for book in soup.find_all('a',  {'class':'pollAnswer__bookLink'}) :
    print(book.get_text())

CodePudding user response：

So first you need to iterate over the found tags and get the content of img tag, after that you use get() to get the values of the attributes of a tag.

from bs4 import BeautifulSoup as bs
import requests

url = "https://www.goodreads.com/choiceawards/best-fiction-books-2020"
page = requests.get(url)
soup = bs(page.content, "html.parser")

result = soup.find_all("a", class_="pollAnswer__bookLink")


for book in result:
    print(book.img.get("title"))

Output:

The Midnight Library by Matt Haig
Anxious People by Fredrik Backman
American Dirt by Jeanine Cummins
Such a Fun Age by Kiley Reid
My Dark Vanessa by Kate Elizabeth Russell
The Glass Hotel by Emily St. John Mandel
Transcendent Kingdom by Yaa Gyasi
The Girl with the Louding Voice by Abi Daré
Dear Edward by Ann Napolitano
Big Summer by Jennifer Weiner
Writers & Lovers by Lily King
If I Had Your Face by Frances Cha
A Burning by Megha Majumdar
Luster by Raven Leilani
In an Instant by Suzanne Redfearn
Oona Out of Order by Margarita Montimore
The Death of Vivek Oji by Akwaeke Emezi
Homeland Elegies by Ayad Akhtar
Real Life by Brandon  Taylor
Migrations by Charlotte McConaghy

CodePudding user response：

Web Scraping is the concept of the data should be taken the websites whichever you want. There should be direct scraping from the websites. The protocols should be like HTTP, HTTPS, etc... In python, we have the library called beautifulsoup by using that we can extract the data from the website in an easy manner.

So first you need to iterate over the found tags and get the content of img tag, after that you use get() to get the values of the attributes of a tag.

Use this code: code

CodePudding user response：

What happens?

Your very close to the solution, but main issue is, that the text you are looking for is not provided as human readable text, it is stored as value of the title attribute of the <img>.

How to fix?

Simply select the <img> in your iteration an get its title while changing:

print(book.get_text())

print(book.img['title'])

print(book.img.get('title'))

Note: Difference between the "methods" is that .get() will give you a None if attribute is not available, while direct selection via [attr] will raise an error

Example

from bs4 import BeautifulSoup as bs

url= "https://www.goodreads.com/choiceawards/best-fiction-books-2020" 
page = requests.get(url) 
soup = bs(page.content, 'html.parser') 

soup.find_all('a', class_='pollAnswer__bookLink')

for book in soup.find_all('a',  {'class':'pollAnswer__bookLink'}) :
    print(book.img['title'])

Output

The Midnight Library by Matt Haig
Anxious People by Fredrik Backman
American Dirt by Jeanine Cummins
Such a Fun Age by Kiley Reid
My Dark Vanessa by Kate Elizabeth Russell
The Glass Hotel by Emily St. John Mandel
Transcendent Kingdom by Yaa Gyasi
The Girl with the Louding Voice by Abi Daré
Dear Edward by Ann Napolitano
Big Summer by Jennifer Weiner
...