Home > Mobile >  BeautifulSoup scraping returns no data
BeautifulSoup scraping returns no data

Time:09-26

I want to scrape a page from a website that includes the following HTML:

<div >
<div ng-bind="1" >First item</div>
<div ng-bind="2" >Second item</div>
<div ng-bind="3" >Third item</div>
</div>

Here is my code:

from bs4 import BeautifulSoup
from urllib import request

my_url = 'https://some.site/some/file.html?param=value'

with request.urlopen(my_url) as r:
    soup = BeautifulSoup(r.read(), "html.parser")

result = soup.findAll('item')
print(result)

But I get an empty list as a result ([]). I also tried:

result = soup.find('item')
print(result)

But that prints None.

Why doesn't my code find the items? I can see the items on the page, so I know they are there?

CodePudding user response:

The above is a very common type of question about web-scraping in general and BeautifulSoup in particular. The problem is usually one of the following (explained below):

  • trying to match a class, but using the syntax to match an element / tag
  • trying to match part of the class name the element actually has
  • trying to match a single class, when the elements needed have multiple
  • trying to match elements that don't get loaded in the script (but do get loaded in the browser)

Other common problems are a page not actually loading (i.e. a http response status other than 200 is returned). A status code of 403 would indicate access is not allowed and may be resolved by added headers or cookies. A status code of 500 indicates a server problem, which may be caused by making a request that causes a problem on the server side.

It's also possible that a response is only correct after previous pages have been visited, and again provide the correct headers or cookie may resolve that.

Matching a tag instead of a class

Where the code above reads:

result = soup.findAll('item')

If it instead read:

result = soup.findAll('div')

There would be at least 4 matches - the first being the outer div with all the contents, and then the inner divs separately.

To actually match divs with the item class, the code would have to be:

result = soup.findAll('div', {"class": "item"})

To match multiple tag types with that class, for example both div and td:

result = soup.findAll(['div', 'td'], {"class": "item"})

Matching partial class name

In the previous example, one div still would not be matched with:

result = soup.findAll('div', {"class": "item"})  # only 2 results

Since one of the divs actually has the class item-alt, which starts with item but the full class name is item-alt.

To match partial classnames, you should make use of the fact that .findAll() accepts regular expressions and functions as values to compare attributes (like class) to:

import re

result = soup.findAll('div', {"class": re.compile('item.*')})  # finds all 3
result = soup.findAll('div', {"class": lambda c: c.startswith('item')})  # also finds all 3

Regular expressions and lambdas are very powerful and there's plenty of tutorials and documentation on how to use them - neither requires installing third party packages, they are a part of the standard Python libraries.

Matching multiple classes

If only the odd classes need to be matched, the following does not work:

result = soup.findAll('div', {"class": ["item", "odd"]})

This instead matches any item that has either of the item and odd classes, so in the question's example, it would match both of the first 2 inner divs. Think of this as selecting one class or the other.

To only match the first (which has both classes) using .select() is a good option:

result = soup.select('div.item.odd'})  # .odd.item would also work

This selects only div elements that have the one class and the other.

Using .select() could also work for some of the problems above, but it lacks some of the options of .findAll(), and it may perform differently. It's mainly useful if you can express the search as a CSS selector, and you need to keep in mind that support for pseudo classes is very limited.

Matching elements that don't get loaded

Even if your BeautifulSoup code is perfect, you may still not see the result you expected after looking at the page's source in a browser.

This is because most users will try to load the html using urllib (like the example above) or a third party library like requests. Both will load the html just fine, but neither will execute any scripts that would be loaded with the page and executed by a browser.

If the elements you are trying to scrape are generated from JavaScript, or loaded after the page has loaded and updated in the document with JavaScript, they won't be available in the loaded html itself.

The ng-bind attributes in the example above are a clear indication that this may be the case here, since they indicate that the page is using Angular. You may see other, but similar attributes in pages using other web frameworks. In general, you should load the html, and save or print it, to inspect it and see if the elements you are trying to match are actually loaded with the html.

If not, a solution using a third party library like selenium may be required. Selenium allows a Python script to 'puppeteer' a browser, either while you can see it, or invisibly in the background. It will load your page, and you can even use Selenium to interact with the page (click buttons and links, fill out values, etc.)

A simple example matching the code above:

from bs4 import BeautifulSoup
from selenium import webdriver
from os import environ, pathsep

environ["PATH"]  = pathsep   './bin'
browser = webdriver.Firefox()

browser.get(my_url)
soup = BeautifulSoup(browser.page_source, "html.parser")

This example uses Firefox, but you can use most common browsers, as long as you download the matching selenium browser driver for it, and put in in a location you add to the path (./bin in this example).

  • Related