Home > Enterprise >  Web Scraping Unordered List Issue
Web Scraping Unordered List Issue

Time:04-30

I'm working through a book called "Learn Python by Building Data Science Applications", and there's a chapter on web scraping, which I fully admit I have not played with before. I've reached a portion where it discusses unordered lists and how to work with them, and my code is generating an error that doesn't make sense to me:

Traceback (most recent call last): File "/Users/gillian/100-days-of-code/Learn-Python-by-Building-Data-Science-Applications/Chapter07/wiki2.py", line 77, in list_element = front.find_next_siblings("div", "div-col columns column-width")[0].ul IndexError: list index out of range

My first thought was that there simply wasn't an unordered list on the page anymore, but I checked, and... there is. My interpretation of this error is that it's not returning the list but I'm having trouble figuring out how to test that, and I fully admit that recursion makes me dizzy and it's not my best area.

My full code is attached (including the notes I took, hence the giant blocks of comments)

    '''scrapes list of WWII battles'''
import requests as rq

base_url = 'https://en.wikipedia.org/wiki/List_of_World_War_II_battles'
response = rq.get(base_url)

'''access the raw content of a page with response.content'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')

def get_dom(url):
    response = rq.get(url)
    response.raise_for_status()
    return BeautifulSoup(response.content, 'html.parser')

'''3 ways to search for an element:
    1. find
    2. finda_all
    3. select

for 1 and 2 you pass an object type and attributes, maybe,

a recursive argument defines if the search should be recursive
First method retrieves first occurrence
Second method will always return a list with all elements
select will return a list and expects you to pass a single CSS selector string

this makes select easier to use, sometimes
'''
content = soup.select('div#mw-content-text > div.mw-parser-output', limit=1)[0]

'''
collect corresponding elements for each front, which are all h2 headers

all fronts are sections - each with a title in h2 but hierarchically the titles are not nested within the sections

last title is citations and notes

one way is to just drop the last element or we can use a CSS Selector trick, which is to specify :not(:last-of-type) but that is less readable

'''
fronts = content.select('div.mw-parser-output>h2')[:-1]

for el in fronts:
    print(el.text[:-6])

'''getting the corresponding ul lists for each header

bs4 has a find_next_siblings method that works like find_all except that it will look in the document after each element

to get this all simultaneously, we'll need to use recursion
'''

def dictify(ul, level=0):
    result = dict()
    for li in ul.find_all("li", recursive=False):
        text = li.stripped_strings
        key = next(text)
        try:
            time = next(text).replace(':', '').strip()
        except StopIteration:
            time = None
        ul, link = li.find("ul"), li.find('a')
        if link:
            link = _abs_link(link.get('href'))
        r ={'url': link,
            'time':time,
            'level': level}
        if ul:
            r['children'] = dictify(ul, level=(level   1))
        result[key] = r
    return result

theaters = {}

for front in fronts:
    list_element = front.find_next_siblings("div", "div-col columns column-width")[0].ul
    theaters[front.text[:-6]] = dictify(list_element)

If anyone has any input about how I can proceed to troubleshoot this, I'd really appreciate it. Thanks.

CodePudding user response:

The error means that .find_next_siblings didn't find anything. Try to change it to front.find_next_siblings("div", "div-col"). Also _abs_link() isn't specified, so I removed it:

"""scrapes list of WWII battles"""
import requests as rq

base_url = "https://en.wikipedia.org/wiki/List_of_World_War_II_battles"
response = rq.get(base_url)

"""access the raw content of a page with response.content"""
from bs4 import BeautifulSoup

soup = BeautifulSoup(response.content, "html.parser")


def get_dom(url):
    response = rq.get(url)
    response.raise_for_status()
    return BeautifulSoup(response.content, "html.parser")


"""3 ways to search for an element:
    1. find
    2. finda_all
    3. select

for 1 and 2 you pass an object type and attributes, maybe,

a recursive argument defines if the search should be recursive
First method retrieves first occurrence
Second method will always return a list with all elements
select will return a list and expects you to pass a single CSS selector string

this makes select easier to use, sometimes
"""
content = soup.select("div#mw-content-text > div.mw-parser-output", limit=1)[0]

"""
collect corresponding elements for each front, which are all h2 headers

all fronts are sections - each with a title in h2 but hierarchically the titles are not nested within the sections

last title is citations and notes

one way is to just drop the last element or we can use a CSS Selector trick, which is to specify :not(:last-of-type) but that is less readable

"""
fronts = content.select("div.mw-parser-output>h2")[:-1]

for el in fronts:
    print(el.text[:-6])

"""getting the corresponding ul lists for each header

bs4 has a find_next_siblings method that works like find_all except that it will look in the document after each element

to get this all simultaneously, we'll need to use recursion
"""


def dictify(ul, level=0):
    result = dict()
    for li in ul.find_all("li", recursive=False):
        text = li.stripped_strings
        key = next(text)
        try:
            time = next(text).replace(":", "").strip()
        except StopIteration:
            time = None
        ul, link = li.find("ul"), li.find("a")
        if link:
            link = link.get("href")
        r = {"url": link, "time": time, "level": level}
        if ul:
            r["children"] = dictify(ul, level=(level   1))
        result[key] = r
    return result


theaters = {}

for front in fronts:
    list_element = front.find_next_siblings("div", "div-col")[0].ul
    theaters[front.text[:-6]] = dictify(list_element)

print(theaters)

Prints:

{
    "African Front": {
        "North African campaign": {
            "url": "/wiki/North_African_campaign",
            "time": "June 1940 - May 1943",
            "level": 0,
            "children": {
                "Western Desert campaign": {
                    "url": "/wiki/Western_Desert_campaign",
                    "time": "June 1940 – February 1943",
                    "level": 1,
                    "children": {
                        "Italian invasion of Egypt": {
                            "url": "/wiki/Italian_invasion_of_Egypt",
                            "time": "September 1940",
                            "level": 2,
                        },
                        "Operation Compass": {
                            "url": "/wiki/Operation_Compass",
                            "time": "December 1940 – February 1941",
                            "level": 2,
                            "children": {
                                "Battle of Nibeiwa": {
                                    "url": "/wiki/Battle_of_Nibeiwa",
                                    "time": "December 1940",
                                    "level": 3,
                                },

...and so on.
  • Related