I have the following code
# Import libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://www.ipma.pt/pt/otempo/obs.superficie/table-top-stations-all.jsp'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml')
# Get the content for tab_Co id
temp_table = soup.find('table', id='tab_Co')
# Create Headers
headers = []
for i in temp_table.find_all('th'):
title = i.text
headers.append(title)
# Create DataFrame with the headers as columns
mydata = pd.DataFrame(columns = headers)
# This is where the script goes wrong
# Create loop that retrieves information and appends it to the DataFrame
for j in table1.find_all('tr')[1:]:
row_data = j.find_all('td')
row = [i.text for i in row_data]
length = len(mydata)
mydata.loc[length] = row
What am I doing wrong? The final purpose is to have a dataframe where I can extract the top 4 values for each column
'Temperatura Max (ºC)',
'Temperatura Min (ºC)',
'Prec. acumulada (mm)',
'Rajada máxima (km/h)',
'Humidade Max (%)',
'Humidade Min (%)',
'Pressão atm. (hPa)']
and then use those to generate a daily image. Any ideas? Thank you in advance!
Disclaimer: This is for a non-for-profit project and no commercial use will be made of the solution.
CodePudding user response:
So this worked, based on the help from @SirGattto on Twitter
# Import libraries
import requests
import pandas as pd
import regex
# Define target URL
url = 'https://www.ipma.pt/pt/otempo/obs.superficie/table-top-stations-all.jsp'
# Get URL information
page = requests.get(url)
# After inspecting the page apply a regex search
search = re.search('var observations = (.*?);',page.text,re.DOTALL);
# Create dict by loading the json information
json_data = json.loads(search.group(1))
# Create Dataframe from json result
df1 = pd.concat({k: pd.DataFrame(v).T for k, v in json_data.items()}, axis=0)
CodePudding user response:
From the source view-source:https://www.ipma.pt/pt/otempo/obs.superficie/table-top-stations-all.jsp
, it is clear that the data is in the th
attributes so try scraping with row_data = j.find_all('th')