Index error: list index out of range - How to skip a broken URL?-CodePudding

How can I tell my program to skip broken / non-existent URLs and continue with the task? Every time I run this, it will stop whenever it encounters a URL that doesn't exist and gives the error: index error: list index out of range.

The range is URL's between 1 to 450, but there are some pages in the mix that are broken (for example, URL 133 doesn't exist).

import requests
import pandas as pd
import json
from pandas.io.json import json_normalize
from bs4 import BeautifulSoup

df = pd.DataFrame()

for id in range (1, 450):

      url = f"https://liiga.fi/api/v1/shotmap/2022/{id}"
      res = requests.get(url)
      soup = BeautifulSoup(res.content, "lxml")
      s = soup.select('html')[0].text.strip('jQuery1720724027235122559_1542743885014(').strip(')')
      s = s.replace('null','"placeholder"')
      data = json.loads(s)
      data = json_normalize(data)
      matsit = pd.DataFrame(data)
      df = pd.concat([df, matsit], axis=0)


df.to_csv("matsit.csv", index=False)

CodePudding user response：

I would assume your index error comes from the line of code with the following statement:

s = soup.select('html')[0].text.strip('jQuery1720724027235122559_1542743885014(').strip(')')

You could solve it like this:

try:
    s = soup.select('html')[0].text.strip('jQuery1720724027235122559_1542743885014(').strip(')')
except IndexError as IE:
    print(f"Indexerror: {IE}")
    continue

If the error does not occur on the line above, just catch the exception on the line where the index error is occuring. Alternatively you can also just catch all exceptions with


try:
    code_where_exception_occurs
except Exception as e:
    print(f"Exception: {e}")
    continue

but I would recommend to be as specific as possible, so that you handle all expected errors in the appropriate way. In the example above replace code_where_exception_occurs with the code. You could also put the try/except clause around the whole block of code inside the for loop, but it is best to catch all exeptions individually. This should also work:

try:
    url = f"https://liiga.fi/api/v1/shotmap/2022/{id}"
    res = requests.get(url)
    soup = BeautifulSoup(res.content, "lxml")
    s = soup.select('html')[0].text.strip('jQuery1720724027235122559_1542743885014(').strip(')')
    s = s.replace('null','"placeholder"')
    data = json.loads(s)
    data = json_normalize(data)
    matsit = pd.DataFrame(data)
    df = pd.concat([df, matsit], axis=0)
except Exception as e:
    print(f"Exception: {e}")
    continue

CodePudding user response：

Main issue is that you get a 204 error (e.g.: https://liiga.fi/api/v1/shotmap/2022/405) for some of the urls, so simply use if-statement to check and handle this:

for i in range (400, 420):
    url = f"https://liiga.fi/api/v1/shotmap/2022/{i}"
    r=requests.get(url)
    
    if r.status_code != 200:
        print(f'Error occured: {r.status_code} on url: {url}')
        #### log or do what ever you like to do in case of error
    else:
        data.append(pd.json_normalize(r.json()))

Note: As already mentioned in https://stackoverflow.com/a/73584487/14460824 there is no need to use BeautifulSoup, use pandas directly instead to keep your code clean

Example

import requests, time
import pandas as pd

data = []
for i in range (400, 420):
    url = f"https://liiga.fi/api/v1/shotmap/2022/{i}"
    r=requests.get(url)
    
    if r.status_code != 200:
        print(f'Error occured: {r.status_code} on url: {url}')
    else:
        data.append(pd.json_normalize(r.json()))

pd.concat(data, ignore_index=True)#.to_csv("matsit", index=False)

Output

Error occured: 204 on url: https://liiga.fi/api/v1/shotmap/2022/405