I wrote this code to extract multiple pages of data from this site (base URL - "https://www.goodreads.com/shelf/show/fiction").
import requests
from bs4 import BeautifulSoup
import pandas as pd
page = 1
book_title = []
while page != 5:
url = 'https://www.goodreads.com/shelf/show/fiction?page={page}'
response = requests.get(url)
page_content = response.text
doc = BeautifulSoup(page_content, 'html.parser')
a_tags = doc.find_all('a', {'class': 'bookTitle'})
for tag in a_tags:
book_title.append(tag.text)
page = page 1
But it's only showing the first 50 books' data. How can I extract all fiction books' names extracting all pages using beautifulsoup?
CodePudding user response:
You can make the pagination from fiction category of the books from your base base url, you need to input the fiction
keyword in search box and click on search button then you will get this url :https://www.goodreads.com/search?q=fiction&qid=ydDLZMCwDJ and from here you have to collect data and to make the next pages.
import requests
from bs4 import BeautifulSoup
import pandas as pd
book_title = []
url = 'https://www.goodreads.com/search?page={page}&q=fiction&qid=ydDLZMCwDJ&tab=books'
for page in range(1,11):
response = requests.get(url.format(page=page))
page_content = response.text
doc = BeautifulSoup(page_content, 'html.parser')
a_tags = doc.find_all('a', {'class': 'bookTitle'})
for tag in a_tags:
book_title.append(tag.get_text(strip=True))
df = pd.DataFrame(book_title,columns=['Title'])
print(df)
Output:
Title
0 Trigger Warning: Short Fictions and Disturbances
1 You Are Not So Smart: Why You Have Too Many Fr...
2 Smoke and Mirrors: Short Fiction and Illusions
3 Fragile Things: Short Fictions and Wonders
4 Collected Fictions
.. ...
195 The Science Fiction Hall of Fame, Volume One, ...
196 The Art of Fiction: Notes on Craft for Young W...
197 Invisible Planets: Contemporary Chinese Scienc...
198 How Fiction Works
199 Monster, She Wrote: The Women Who Pioneered Ho...
[200 rows x 1 columns]