How to print limited number of rows from specific HTML table using Python-CodePudding

I'm essentially trying to mimic the way IMPORTHTML works in Google Sheets.

I successfully scraped the tables I needed in Sheets with very basic functions if you look in the green cells on "Wave Data" and "Tide Data"

https://docs.google.com/spreadsheets/d/1mbst-uaRGHWG5ReoFfIsazx0kpY7kXKIBqsRswy1y1Q/edit#gid=1611362673

I'm looking to do the same here - really, I just want to figure out how to scrape those two data sets. I want to do it in Python because I ultimately plan on looping through several links daily and IMPORTHTML won't be able to handle it.

I thought this would give me the first 8 rows of the main table, but no luck.

import requests
from bs4 import BeautifulSoup
url = "https://magicseaweed.com/Belmar-Surf-Report/3683/"
html = requests.get(url).content

soup = BeautifulSoup(html)
table = soup.select_one(
    "table table-primary table-forecast allSwellsActive msw-js-table ")

for row in table.find_all("tr")[:8]:
    td1 = row.find_all("td")

    print(td1)

Any thoughts on how I can get the same output I'm getting in sheets, via Python?

CodePudding user response：

If you want to search by multiple classes you need to use: '.'

table = soup.select_one("table.table-primary.table-forecast.allSwellsActive.msw-js-table")

CodePudding user response：

If you want to achieve your goal without using Google Spreadsheet, how about the following sample script?

Sample script:

import io
import re
import pandas as pd
import requests
from bs4 import BeautifulSoup

url = "https://magicseaweed.com/Belmar-Surf-Report/3683/"
html = requests.get(url).content
soup = BeautifulSoup(html, "html.parser")

# table 1
regex = re.compile("^table table-primary.*")
table1 = soup.find("table", {"class": regex})
df1 = pd.read_html(io.StringIO(str(table1)))[0].iloc[:9, [0, 1, 2, 3, 4, 6, 7, 12, 15]]
res1 = [df1.columns.values.tolist(), *df1.values.tolist()]
print(res1)

# table 2
tables = soup.findAll("table")
table2 = tables[0]
df2 = pd.read_html(io.StringIO(str(table2)))[0]
res2 = df2.values.tolist()
print(res2)

# table 3
table3 = tables[1]
df3 = pd.read_html(io.StringIO(str(table3)))[0]
res3 = df3.values.tolist()
print(res3)

In this sample script, the retrieved HTML tables are parsed using pandas.

Result:

`print(res1)`

This is for "Wave Data" sheet.

[
  ['Unnamed: 0', 'Surf', 'Swell Rating', 'Primary Swell', 'Primary Swell.1', 'Secondary Swell', 'Secondary Swell.1', 'Wind', 'Weather.1'],
  ['Thursday 08/12', 'Thursday 08/12', 'Thursday 08/12', 'Thursday 08/12', 'Thursday 08/12', 'Thursday 08/12', 'Thursday 08/12', 'Thursday 08/12', 'Thursday 08/12'],
  ['12am', '0.8-1.2m', nan, '1m', '9s', '0.4m', '5s', '18 35 kph', '11°c'],
  ['3am', '0.7-1.1m', nan, '0.9m', '9s', '0.4m', '5s', '24 45 kph', '11°c'],
  ['6am', '0.7-1.1m', nan, '0.8m', '9s', '0.3m', '5s', '26 42 kph', '9°c'],
  ['9am', '0.6-1m', nan, '0.7m', '9s', '0.3m', '11s', '22 41 kph', '10°c'],
  ['Noon', '0.6-1m', nan, '0.7m', '9s', '0.3m', '11s', '24 34 kph', '11°c'],
  ['3pm', '0.7-1m', nan, '0.8m', '9s', '0.2m', '5s', '22 33 kph', '11°c'],
  ['6pm', '0.7-1.1m', nan, '0.8m', '11s', '0.1m', '5s', '21 32 kph', '10°c'],
  ['9pm', '0.8-1.2m', nan, '0.9m', '11s', '0.1m', '5s', '25 31 kph', '8°c']
]

`print(res2)`

This is for the cell "A1" of "Tide Data" sheet.

[
  ['Low', '12:37AM', '-0.01m'],
  ['High', '6:59AM', '1.56m'],
  ['Low', '1:27PM', '-0.06m'],
  ['High', '7:28PM', '1.25m']
]

`print(res3)`

This is for the cell "D1" of "Tide Data" sheet.

[
  ['First Light', '6:34AM'],
  ['Sunrise', '7:04AM'],
  ['Sunset', '4:31PM'],
  ['Last Light', '5:02PM']
]

Note:

When I tested this script, the units are m, kph and °c instead of ft, mph and °f, respectively. I thought that this might depend on the region. In my case, it's Japan. So, if your retrieved units are different from your sample Spreadsheet, please modify it.

CodePudding user response：

Pandas can do the extraction for you. Just needs a bit of post-processing. The request needs some headers as the page forbids direct scraping.

import requests
import pandas as pd

url = 'https://magicseaweed.com/Belmar-Surf-Report/3683/'

header = {
  "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
  "X-Requested-With": "XMLHttpRequest"
}

response = requests.get(url, headers=header)

dfs = pd.read_html(response.text)