Home > Software design >  Need help getting tr values when scraping
Need help getting tr values when scraping

Time:03-28

I have the following code

# Import libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd

url = 'https://www.ipma.pt/pt/otempo/obs.superficie/table-top-stations-all.jsp'

page = requests.get(url)

soup = BeautifulSoup(page.text, 'lxml')
# Get the content for tab_Co id 
temp_table = soup.find('table', id='tab_Co')
# Create Headers
headers = []
for i in temp_table.find_all('th'):
 title = i.text
 headers.append(title)
# Create DataFrame with the headers as columns 
mydata = pd.DataFrame(columns = headers)

# This is where the script goes wrong
# Create loop that retrieves information and appends it to the DataFrame
for j in table1.find_all('tr')[1:]:
 row_data = j.find_all('td')
 row = [i.text for i in row_data]
 length = len(mydata)
 mydata.loc[length] = row

What am I doing wrong? The final purpose is to have a dataframe where I can extract the top 4 values for each column

'Temperatura Max (ºC)',
 'Temperatura Min (ºC)',
 'Prec. acumulada (mm)',
 'Rajada máxima (km/h)',
 'Humidade Max (%)',
 'Humidade Min (%)',
 'Pressão atm. (hPa)']

and then use those to generate a daily image. Any ideas? Thank you in advance!

Disclaimer: This is for a non-for-profit project and no commercial use will be made of the solution.

CodePudding user response:

So this worked, based on the help from @SirGattto on Twitter

# Import libraries 
import requests
import pandas as pd
import regex
# Define target URL 
url = 'https://www.ipma.pt/pt/otempo/obs.superficie/table-top-stations-all.jsp'

# Get URL information 
page = requests.get(url)

# After inspecting the page apply a regex search 
search = re.search('var observations = (.*?);',page.text,re.DOTALL);

# Create dict by loading the json information
json_data = json.loads(search.group(1))

# Create Dataframe from json result 
df1 = pd.concat({k: pd.DataFrame(v).T for k, v in json_data.items()}, axis=0)

CodePudding user response:

From the source view-source:https://www.ipma.pt/pt/otempo/obs.superficie/table-top-stations-all.jsp, it is clear that the data is in the th attributes so try scraping with row_data = j.find_all('th')

  • Related