Home > OS >  WebScraping: Pandas to_excel Not Displaying full DataFrame
WebScraping: Pandas to_excel Not Displaying full DataFrame

Time:01-20

I am brand new to coding, and was given a web scraping tutorial (found here) to help build my skills as I learn. I've already had to make several adjustments to the code in this tutorial, but I digress. I'm scraping off of http://books.toscrape.com/ and, when I try to export a Dataframe of just the book categories into Excel, I get a couple of issues. Note that, when exporting to csv (and then opening the file in Notepad), these issues are not present. I am working in a Jupyter Notebook in Azure Data Studio.

First, the row with all the data appears to not exist, even though it is displayed, making it so that I have to tab over to each column to go past the data that is shown in the default windowsize of Excel. Second, it only displays the first 9 results (the first being "Books," and the other 8 being the first 8 categories).

Image of desired scrape section

Here is my code:

s = Service('C:/Users/.../.../chromedriver.exe')
browser = webdriver.Chrome(service=s)
url = 'http://books.toscrape.com/'
browser.get(url)
results = []
content = browser.page_source
soup = BeautifulSoup(content)
# changes from the tutorial due to recommendations
# from StackOverflow, based on similar questions
# from error messages popping up when using original
# formatting; tutorial is outdated. 

for li in soup.find_all(attrs={'class': 'side_categories'}):
    name = element.find('li')
    if name not in results:
        results.append(name.text)

df = pd.DataFrame({'Categories': results})
df.to_excel('categories.xlsx', index=False)
# per https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.io.formats.style.Styler.to_excel.html
# encoding is Deprecated and apparently wasn't
# needed for any excel writer other than xlwt,
# which is no longer maintained.

Images of results:

View before tabbing further in the columns

End of the displayed results

What can I do to fix this?

Edit: Many apologies, I didn't realize I have copied an older, incorrect version of my code blocks. Should be correct now.

CodePudding user response:

The code in question will not create any dataframe. However, you should select your elements more specific for example with css selectors:

for a in soup.select('ul.nav-list a'):
    if a.get_text(strip=True) not in results:
        results.append(a.get_text(strip=True))

Example

import requests
from bs4 import BeautifulSoup
import pandas as pd

results = []

soup = BeautifulSoup(requests.get('http://books.toscrape.com/').content)

for a in soup.select('ul.nav-list a'):
    if a.get_text(strip=True) not in results:
        results.append(a.get_text(strip=True))

pd.DataFrame({'Categories': results})

Output

Categories
0 Books
1 Travel
2 Mystery
3 Historical Fiction
4 Sequential Art
5 Classics
6 Philosophy

...

  • Related