I am brand new to coding, and was given a web scraping tutorial (found here) to help build my skills as I learn. I've already had to make several adjustments to the code in this tutorial, but I digress. I'm scraping off of http://books.toscrape.com/ and, when I try to export a Dataframe of just the book categories into Excel, I get a couple of issues. Note that, when exporting to csv (and then opening the file in Notepad), these issues are not present. I am working in a Jupyter Notebook in Azure Data Studio.
First, the row with all the data appears to not exist, even though it is displayed, making it so that I have to tab over to each column to go past the data that is shown in the default windowsize of Excel. Second, it only displays the first 9 results (the first being "Books," and the other 8 being the first 8 categories).
Image of desired scrape section
Here is my code:
s = Service('C:/Users/.../.../chromedriver.exe')
browser = webdriver.Chrome(service=s)
url = 'http://books.toscrape.com/'
browser.get(url)
results = []
content = browser.page_source
soup = BeautifulSoup(content)
# changes from the tutorial due to recommendations
# from StackOverflow, based on similar questions
# from error messages popping up when using original
# formatting; tutorial is outdated.
for li in soup.find_all(attrs={'class': 'side_categories'}):
name = element.find('li')
if name not in results:
results.append(name.text)
df = pd.DataFrame({'Categories': results})
df.to_excel('categories.xlsx', index=False)
# per https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.io.formats.style.Styler.to_excel.html
# encoding is Deprecated and apparently wasn't
# needed for any excel writer other than xlwt,
# which is no longer maintained.
Images of results:
View before tabbing further in the columns
What can I do to fix this?
Edit: Many apologies, I didn't realize I have copied an older, incorrect version of my code blocks. Should be correct now.
CodePudding user response:
The code in question will not create any dataframe. However, you should select your elements more specific for example with css selectors
:
for a in soup.select('ul.nav-list a'):
if a.get_text(strip=True) not in results:
results.append(a.get_text(strip=True))
Example
import requests
from bs4 import BeautifulSoup
import pandas as pd
results = []
soup = BeautifulSoup(requests.get('http://books.toscrape.com/').content)
for a in soup.select('ul.nav-list a'):
if a.get_text(strip=True) not in results:
results.append(a.get_text(strip=True))
pd.DataFrame({'Categories': results})
Output
Categories | |
---|---|
0 | Books |
1 | Travel |
2 | Mystery |
3 | Historical Fiction |
4 | Sequential Art |
5 | Classics |
6 | Philosophy |
...