I know this has to do with looping the URLs, but I thought my for loop was correct.
The code that some amazing person helped me with to scrape specific tables from ONE site is here:
import io
import re
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = "https://magicseaweed.com/Belmar-Surf-Report/3683/"
html = requests.get(url).content
soup = BeautifulSoup(html, "html.parser")
# table 1
regex = re.compile("^table table-primary.*")
table1 = soup.find("table", {"class": regex})
df1 = pd.read_html(io.StringIO(str(table1)))[
0].iloc[:9, [0, 1, 2, 3, 4, 6, 7, 12, 15]]
res1 = [df1.columns.values.tolist(), *df1.values.tolist()]
# print(res1)
# table 2
tables = soup.findAll("table")
table2 = tables[0]
df2 = pd.read_html(io.StringIO(str(table2)))[0]
res2 = df2.values.tolist()
# print(res2)
# table 3
table3 = tables[1]
df3 = pd.read_html(io.StringIO(str(table3)))[0]
res3 = df3.values.tolist()
print(res3)
It's amazing. I wanted to build off it though and scrape these three tables from MULTIPLE URLS. I added in a for loop, with the ID list, but I can't understand why I'm getting a "No Tables Found" result. I'm trying to better learn this stuff - can someone explain why this is happening/what I'm doing wrong? I feel I'm so close, but stuck.
import io
import re
import pandas as pd
import requests
from bs4 import BeautifulSoup
id_list = [
'/Belmar-Surf-Report/3683',
'/Manasquan-Surf-Report/386/',
'/Ocean-Grove-Surf-Report/7945/',
'/Asbury-Park-Surf-Report/857/',
'/Avon-Surf-Report/4050/',
'/Bay-Head-Surf-Report/4951/',
'/Belmar-Surf-Report/3683/',
'/Boardwalk-Surf-Report/9183/',
]
for x in id_list:
url = 'https://magicseaweed.com' x
html = requests.get(url).content
soup = BeautifulSoup(html, "html.parser")
# table 1
regex = re.compile("^table table-primary.*")
table1 = soup.find("table", {"class": regex})
df1 = pd.read_html(io.StringIO(str(table1)))[
0].iloc[:9, [0, 1, 2, 3, 4, 6, 7, 12, 15]]
res1 = [df1.columns.values.tolist(), *df1.values.tolist()]
print(res1)
# table 2
tables = soup.findAll("table")
table2 = tables[0]
df2 = pd.read_html(io.StringIO(str(table2)))[0]
res2 = df2.values.tolist()
print(res2)
# table 3
table3 = tables[1]
df3 = pd.read_html(io.StringIO(str(table3)))[0]
res3 = df3.values.tolist()
print(res3)
CodePudding user response:
Check your urls, they should be valid in syntax to get the right response back - First one of your examples goes to an 404.
Also try to keep your script simpler, there is no need for separat use of BeauftifulSoup
it is still and well handled by pandas
under the hood.
In newer code avoid old syntax findAll()
instead use find_all()
or select()
with css selectors
- For more take a minute to check docs
Example
import pandas as pd
import requests
id_list = [
'/Belmar-Surf-Report/3683/',
'/Manasquan-Surf-Report/386/',
'/Ocean-Grove-Surf-Report/7945/',
'/Asbury-Park-Surf-Report/857/',
'/Avon-Surf-Report/4050/',
'/Bay-Head-Surf-Report/4951/',
'/Belmar-Surf-Report/3683/',
'/Boardwalk-Surf-Report/9183/',
]
for x in id_list:
print(pd.read_html(requests.get('http://magicseaweed.com' x).text)[2].iloc[:9, [0, 1, 2, 3, 4, 6, 7, 12, 15]])
print(pd.read_html(requests.get('http://magicseaweed.com' x).text)[0])
print(pd.read_html(requests.get('http://magicseaweed.com' x).text)[1])