Home > Enterprise >  Web-Scraping with BS4 (Python) - Dataframe returning blanks
Web-Scraping with BS4 (Python) - Dataframe returning blanks

Time:10-09

I am trying to write a python code to export a weather dataset to a data frame. In the particular link,it does contain a map. In there we need to extract all the underlying data in it to a particular table. Such as ,

enter image description here

Below is the code I wrote

import pandas as pd
import requests
from bs4 import BeautifulSoup
url = "http://meteo.gov.lk/index.php?option=com_content&view=article&id=102&Itemid=360&lang=en"
data  = requests.get(url).text
soup = BeautifulSoup(data, 'html5lib')
df = pd.DataFrame(columns=["Location", "Status", "Temperature", "Rainfall",  "Reported_Time"])
for row in soup.find_all('tr'):
    col = row.find_all("td")
    Location = col[0].text
    Status = col[1].text
    Temperature = col[2].text
    Rainfall = col[3].text
    Reported_Time = col[4].text
    df = df.append({"Location":Location,"Status":Status,"Temperature":Temperature,"Rainfall":Rainfall,"Reported_Time":Reported_Time,}, ignore_index=True)
print(df)

When running the above, the output gets null. i.e data frame is getting blank. pls refer the below

Empty DataFrame
Columns: [Location, Status, Temperature, Rainfall,RH, Reported_Time]
Index: []

can you pls help me to solve this error? I am really new to python..

CodePudding user response:

The data is stored in a javascript array, which you can load as a dict with json5:

import pandas as pd
import requests
from bs4 import BeautifulSoup
import json5
import re

url = "http://meteo.gov.lk/index.php?option=com_content&view=article&id=102&Itemid=360&lang=en"
data  = requests.get(url).text
soup = BeautifulSoup(data, 'html5lib')
scpt = soup.select('script')

# extract javascript array as python dict
data = scpt[22].text.split('mapdiv",', 1)[1].split(');', 1)[0].strip()
data = json5.loads(data)

# load dict to pandas
df = pd.DataFrame(data['dataProvider']['images'])

# extract data from description column
def extract_data(row):
    r = [i.strip() for i in re.findall('(?<=:).*?(?=<)', row['description'])]
    return float(r[0].split('&#8451;')[0]), r[1][:-2], int(r[2][:-1]), pd.to_datetime(r[3], format='%d/%m/%Y %H%M')

df[['Temperature', 'Rainfall', 'RH', 'Reported_Time']] = df.apply(extract_data, axis=1, result_type='expand')
df['Status'] = df['title'].str.split(': ').str[-1]

final_df = df[['label', 'Status', 'Temperature', 'Rainfall', 'RH', 'Reported_Time']]

Output:

label Status Temperature Rainfall RH Reported_Time
0 Jaffna cloudy 31.2 0 65 2022-10-08 11:30:00
1 Mannar Partly Cloudy 29.6 0 75 2022-10-08 11:30:00
2 Vavuniya Partly Cloudy 33.2 0 70 2022-10-08 11:30:00
3 Trincomalee Partly Cloudy 32 0 60 2022-10-08 11:30:00
4 Anuradhapura Partly Cloudy 33 0 60 2022-10-08 11:30:00
5 Maha Iluppallama Partly Cloudy 31.4 0 55 2022-10-08 11:30:00
6 Polonnaruwa Partly Cloudy 33.8 0 45 2022-10-08 11:30:00
7 Puttalam cloudy 31.6 0 60 2022-10-08 11:30:00
8 Kurunegala Partly Cloudy 32.9 0 60 2022-10-08 11:30:00
9 Batticaloa fair 31.4 0 80 2022-10-08 11:30:00
10 Badulla cloudy 28.6 0 55 2022-10-08 11:30:00
11 Katugastota cloudy 27.4 0 70 2022-10-08 11:30:00
12 Katunayake cloudy 29.2 0 70 2022-10-08 11:30:00
13 Colombo cloudy 30.1 0 70 2022-10-08 11:30:00
14 Rathmalana cloudy 29.8 0 70 2022-10-08 11:30:00
15 Nuwara Eliya cloudy 19 0 75 2022-10-08 11:30:00
16 Bandarawela Haze 24.8 0 55 2022-10-08 11:30:00
17 Ratnapura Haze 28.9 0 70 2022-10-08 11:30:00
18 Monaragala cloudy 31 0 65 2022-10-08 11:30:00
19 Mattala Partly Cloudy 34.1 0 50 2022-10-08 11:30:00
20 Hambanthota Partly Cloudy 34.6 0 45 2022-10-08 11:30:00
21 Galle Partly Cloudy 28 0 80 2022-10-08 11:30:00
22 Pottuvil Partly Cloudy 30 0 75 2022-10-08 11:30:00
23 Mullaitivu Partly Cloudy 31.3 0 65 2022-10-08 11:30:00
  • Related