Home > Software engineering >  How to avoid "No Tables Found" while web-scraping a list of urls in a loop?
How to avoid "No Tables Found" while web-scraping a list of urls in a loop?

Time:12-11

I know this has to do with looping the URLs, but I thought my for loop was correct.

The code that some amazing person helped me with to scrape specific tables from ONE site is here:

import io
import re
import pandas as pd
import requests
from bs4 import BeautifulSoup

url = "https://magicseaweed.com/Belmar-Surf-Report/3683/"
html = requests.get(url).content
soup = BeautifulSoup(html, "html.parser")

# table 1
regex = re.compile("^table table-primary.*")
table1 = soup.find("table", {"class": regex})
df1 = pd.read_html(io.StringIO(str(table1)))[
    0].iloc[:9, [0, 1, 2, 3, 4, 6, 7, 12, 15]]
res1 = [df1.columns.values.tolist(), *df1.values.tolist()]
# print(res1)

# table 2
tables = soup.findAll("table")
table2 = tables[0]
df2 = pd.read_html(io.StringIO(str(table2)))[0]
res2 = df2.values.tolist()
# print(res2)

# table 3
table3 = tables[1]
df3 = pd.read_html(io.StringIO(str(table3)))[0]
res3 = df3.values.tolist()
print(res3)

It's amazing. I wanted to build off it though and scrape these three tables from MULTIPLE URLS. I added in a for loop, with the ID list, but I can't understand why I'm getting a "No Tables Found" result. I'm trying to better learn this stuff - can someone explain why this is happening/what I'm doing wrong? I feel I'm so close, but stuck.

import io
import re
import pandas as pd
import requests
from bs4 import BeautifulSoup


id_list = [
    '/Belmar-Surf-Report/3683',
    '/Manasquan-Surf-Report/386/',
    '/Ocean-Grove-Surf-Report/7945/',
    '/Asbury-Park-Surf-Report/857/',
    '/Avon-Surf-Report/4050/',
    '/Bay-Head-Surf-Report/4951/',
    '/Belmar-Surf-Report/3683/',
    '/Boardwalk-Surf-Report/9183/',
      
]


for x in id_list:

    url = 'https://magicseaweed.com'   x
    html = requests.get(url).content
    soup = BeautifulSoup(html, "html.parser")

    # table 1
    regex = re.compile("^table table-primary.*")
    table1 = soup.find("table", {"class": regex})
    df1 = pd.read_html(io.StringIO(str(table1)))[
        0].iloc[:9, [0, 1, 2, 3, 4, 6, 7, 12, 15]]
    res1 = [df1.columns.values.tolist(), *df1.values.tolist()]
    print(res1)

    # table 2
    tables = soup.findAll("table")
    table2 = tables[0]
    df2 = pd.read_html(io.StringIO(str(table2)))[0]
    res2 = df2.values.tolist()
    print(res2)

    # table 3
    table3 = tables[1]
    df3 = pd.read_html(io.StringIO(str(table3)))[0]
    res3 = df3.values.tolist()
    print(res3)

CodePudding user response:

Check your urls, they should be valid in syntax to get the right response back - First one of your examples goes to an 404.

Also try to keep your script simpler, there is no need for separat use of BeauftifulSoup it is still and well handled by pandas under the hood.

In newer code avoid old syntax findAll() instead use find_all() or select() with css selectors - For more take a minute to check docs

Example

import pandas as pd
import requests

id_list = [
    '/Belmar-Surf-Report/3683/',
    '/Manasquan-Surf-Report/386/',
    '/Ocean-Grove-Surf-Report/7945/',
    '/Asbury-Park-Surf-Report/857/',
    '/Avon-Surf-Report/4050/',
    '/Bay-Head-Surf-Report/4951/',
    '/Belmar-Surf-Report/3683/',
    '/Boardwalk-Surf-Report/9183/',
      
]

for x in id_list:
    print(pd.read_html(requests.get('http://magicseaweed.com'   x).text)[2].iloc[:9, [0, 1, 2, 3, 4, 6, 7, 12, 15]])
    print(pd.read_html(requests.get('http://magicseaweed.com'   x).text)[0])
    print(pd.read_html(requests.get('http://magicseaweed.com'   x).text)[1])
  • Related