Home > Mobile >  How to print limited number of rows from specific HTML table using Python
How to print limited number of rows from specific HTML table using Python

Time:12-09

I'm essentially trying to mimic the way IMPORTHTML works in Google Sheets.

I successfully scraped the tables I needed in Sheets with very basic functions if you look in the green cells on "Wave Data" and "Tide Data"

https://docs.google.com/spreadsheets/d/1mbst-uaRGHWG5ReoFfIsazx0kpY7kXKIBqsRswy1y1Q/edit#gid=1611362673

I'm looking to do the same here - really, I just want to figure out how to scrape those two data sets. I want to do it in Python because I ultimately plan on looping through several links daily and IMPORTHTML won't be able to handle it.

I thought this would give me the first 8 rows of the main table, but no luck.

import requests
from bs4 import BeautifulSoup
url = "https://magicseaweed.com/Belmar-Surf-Report/3683/"
html = requests.get(url).content

soup = BeautifulSoup(html)
table = soup.select_one(
    "table table-primary table-forecast allSwellsActive msw-js-table ")

for row in table.find_all("tr")[:8]:
    td1 = row.find_all("td")

    print(td1)

Any thoughts on how I can get the same output I'm getting in sheets, via Python?

CodePudding user response:

If you want to search by multiple classes you need to use: '.'

table = soup.select_one("table.table-primary.table-forecast.allSwellsActive.msw-js-table")

CodePudding user response:

If you want to achieve your goal without using Google Spreadsheet, how about the following sample script?

Sample script:

import io
import re
import pandas as pd
import requests
from bs4 import BeautifulSoup

url = "https://magicseaweed.com/Belmar-Surf-Report/3683/"
html = requests.get(url).content
soup = BeautifulSoup(html, "html.parser")

# table 1
regex = re.compile("^table table-primary.*")
table1 = soup.find("table", {"class": regex})
df1 = pd.read_html(io.StringIO(str(table1)))[0].iloc[:9, [0, 1, 2, 3, 4, 6, 7, 12, 15]]
res1 = [df1.columns.values.tolist(), *df1.values.tolist()]
print(res1)

# table 2
tables = soup.findAll("table")
table2 = tables[0]
df2 = pd.read_html(io.StringIO(str(table2)))[0]
res2 = df2.values.tolist()
print(res2)

# table 3
table3 = tables[1]
df3 = pd.read_html(io.StringIO(str(table3)))[0]
res3 = df3.values.tolist()
print(res3)
  • In this sample script, the retrieved HTML tables are parsed using pandas.

Result:

print(res1)

This is for "Wave Data" sheet.

[
  ['Unnamed: 0', 'Surf', 'Swell Rating', 'Primary Swell', 'Primary Swell.1', 'Secondary Swell', 'Secondary Swell.1', 'Wind', 'Weather.1'],
  ['Thursday 08/12', 'Thursday 08/12', 'Thursday 08/12', 'Thursday 08/12', 'Thursday 08/12', 'Thursday 08/12', 'Thursday 08/12', 'Thursday 08/12', 'Thursday 08/12'],
  ['12am', '0.8-1.2m', nan, '1m', '9s', '0.4m', '5s', '18 35 kph', '11°c'],
  ['3am', '0.7-1.1m', nan, '0.9m', '9s', '0.4m', '5s', '24 45 kph', '11°c'],
  ['6am', '0.7-1.1m', nan, '0.8m', '9s', '0.3m', '5s', '26 42 kph', '9°c'],
  ['9am', '0.6-1m', nan, '0.7m', '9s', '0.3m', '11s', '22 41 kph', '10°c'],
  ['Noon', '0.6-1m', nan, '0.7m', '9s', '0.3m', '11s', '24 34 kph', '11°c'],
  ['3pm', '0.7-1m', nan, '0.8m', '9s', '0.2m', '5s', '22 33 kph', '11°c'],
  ['6pm', '0.7-1.1m', nan, '0.8m', '11s', '0.1m', '5s', '21 32 kph', '10°c'],
  ['9pm', '0.8-1.2m', nan, '0.9m', '11s', '0.1m', '5s', '25 31 kph', '8°c']
]

print(res2)

This is for the cell "A1" of "Tide Data" sheet.

[
  ['Low', '12:37AM', '-0.01m'],
  ['High', '6:59AM', '1.56m'],
  ['Low', '1:27PM', '-0.06m'],
  ['High', '7:28PM', '1.25m']
]

print(res3)

This is for the cell "D1" of "Tide Data" sheet.

[
  ['First Light', '6:34AM'],
  ['Sunrise', '7:04AM'],
  ['Sunset', '4:31PM'],
  ['Last Light', '5:02PM']
]

Note:

  • When I tested this script, the units are m, kph and °c instead of ft, mph and °f, respectively. I thought that this might depend on the region. In my case, it's Japan. So, if your retrieved units are different from your sample Spreadsheet, please modify it.

CodePudding user response:

Pandas can do the extraction for you. Just needs a bit of post-processing. The request needs some headers as the page forbids direct scraping.

import requests
import pandas as pd

url = 'https://magicseaweed.com/Belmar-Surf-Report/3683/'

header = {
  "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
  "X-Requested-With": "XMLHttpRequest"
}

response = requests.get(url, headers=header)

dfs = pd.read_html(response.text)
  • Related