I'm essentially trying to mimic the way IMPORTHTML works in Google Sheets.
I successfully scraped the tables I needed in Sheets with very basic functions if you look in the green cells on "Wave Data" and "Tide Data"
I'm looking to do the same here - really, I just want to figure out how to scrape those two data sets. I want to do it in Python because I ultimately plan on looping through several links daily and IMPORTHTML won't be able to handle it.
I thought this would give me the first 8 rows of the main table, but no luck.
import requests
from bs4 import BeautifulSoup
url = "https://magicseaweed.com/Belmar-Surf-Report/3683/"
html = requests.get(url).content
soup = BeautifulSoup(html)
table = soup.select_one(
"table table-primary table-forecast allSwellsActive msw-js-table ")
for row in table.find_all("tr")[:8]:
td1 = row.find_all("td")
print(td1)
Any thoughts on how I can get the same output I'm getting in sheets, via Python?
CodePudding user response:
If you want to search by multiple classes you need to use: '.'
table = soup.select_one("table.table-primary.table-forecast.allSwellsActive.msw-js-table")
CodePudding user response:
If you want to achieve your goal without using Google Spreadsheet, how about the following sample script?
Sample script:
import io
import re
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = "https://magicseaweed.com/Belmar-Surf-Report/3683/"
html = requests.get(url).content
soup = BeautifulSoup(html, "html.parser")
# table 1
regex = re.compile("^table table-primary.*")
table1 = soup.find("table", {"class": regex})
df1 = pd.read_html(io.StringIO(str(table1)))[0].iloc[:9, [0, 1, 2, 3, 4, 6, 7, 12, 15]]
res1 = [df1.columns.values.tolist(), *df1.values.tolist()]
print(res1)
# table 2
tables = soup.findAll("table")
table2 = tables[0]
df2 = pd.read_html(io.StringIO(str(table2)))[0]
res2 = df2.values.tolist()
print(res2)
# table 3
table3 = tables[1]
df3 = pd.read_html(io.StringIO(str(table3)))[0]
res3 = df3.values.tolist()
print(res3)
- In this sample script, the retrieved HTML tables are parsed using pandas.
Result:
print(res1)
This is for "Wave Data" sheet.
[
['Unnamed: 0', 'Surf', 'Swell Rating', 'Primary Swell', 'Primary Swell.1', 'Secondary Swell', 'Secondary Swell.1', 'Wind', 'Weather.1'],
['Thursday 08/12', 'Thursday 08/12', 'Thursday 08/12', 'Thursday 08/12', 'Thursday 08/12', 'Thursday 08/12', 'Thursday 08/12', 'Thursday 08/12', 'Thursday 08/12'],
['12am', '0.8-1.2m', nan, '1m', '9s', '0.4m', '5s', '18 35 kph', '11°c'],
['3am', '0.7-1.1m', nan, '0.9m', '9s', '0.4m', '5s', '24 45 kph', '11°c'],
['6am', '0.7-1.1m', nan, '0.8m', '9s', '0.3m', '5s', '26 42 kph', '9°c'],
['9am', '0.6-1m', nan, '0.7m', '9s', '0.3m', '11s', '22 41 kph', '10°c'],
['Noon', '0.6-1m', nan, '0.7m', '9s', '0.3m', '11s', '24 34 kph', '11°c'],
['3pm', '0.7-1m', nan, '0.8m', '9s', '0.2m', '5s', '22 33 kph', '11°c'],
['6pm', '0.7-1.1m', nan, '0.8m', '11s', '0.1m', '5s', '21 32 kph', '10°c'],
['9pm', '0.8-1.2m', nan, '0.9m', '11s', '0.1m', '5s', '25 31 kph', '8°c']
]
print(res2)
This is for the cell "A1" of "Tide Data" sheet.
[
['Low', '12:37AM', '-0.01m'],
['High', '6:59AM', '1.56m'],
['Low', '1:27PM', '-0.06m'],
['High', '7:28PM', '1.25m']
]
print(res3)
This is for the cell "D1" of "Tide Data" sheet.
[
['First Light', '6:34AM'],
['Sunrise', '7:04AM'],
['Sunset', '4:31PM'],
['Last Light', '5:02PM']
]
Note:
- When I tested this script, the units are
m
,kph
and°c
instead offt
,mph
and°f
, respectively. I thought that this might depend on the region. In my case, it's Japan. So, if your retrieved units are different from your sample Spreadsheet, please modify it.
CodePudding user response:
Pandas can do the extraction for you. Just needs a bit of post-processing. The request needs some headers as the page forbids direct scraping.
import requests
import pandas as pd
url = 'https://magicseaweed.com/Belmar-Surf-Report/3683/'
header = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
"X-Requested-With": "XMLHttpRequest"
}
response = requests.get(url, headers=header)
dfs = pd.read_html(response.text)