I am trying to write a python code to export a weather dataset to a data frame. In the particular link,it does contain a map. In there we need to extract all the underlying data in it to a particular table. Such as ,
Below is the code I wrote
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = "http://meteo.gov.lk/index.php?option=com_content&view=article&id=102&Itemid=360&lang=en"
data = requests.get(url).text
soup = BeautifulSoup(data, 'html5lib')
df = pd.DataFrame(columns=["Location", "Status", "Temperature", "Rainfall", "Reported_Time"])
for row in soup.find_all('tr'):
col = row.find_all("td")
Location = col[0].text
Status = col[1].text
Temperature = col[2].text
Rainfall = col[3].text
Reported_Time = col[4].text
df = df.append({"Location":Location,"Status":Status,"Temperature":Temperature,"Rainfall":Rainfall,"Reported_Time":Reported_Time,}, ignore_index=True)
print(df)
When running the above, the output gets null. i.e data frame is getting blank. pls refer the below
Empty DataFrame
Columns: [Location, Status, Temperature, Rainfall,RH, Reported_Time]
Index: []
can you pls help me to solve this error? I am really new to python..
CodePudding user response:
The data is stored in a javascript array, which you can load as a dict with json5:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import json5
import re
url = "http://meteo.gov.lk/index.php?option=com_content&view=article&id=102&Itemid=360&lang=en"
data = requests.get(url).text
soup = BeautifulSoup(data, 'html5lib')
scpt = soup.select('script')
# extract javascript array as python dict
data = scpt[22].text.split('mapdiv",', 1)[1].split(');', 1)[0].strip()
data = json5.loads(data)
# load dict to pandas
df = pd.DataFrame(data['dataProvider']['images'])
# extract data from description column
def extract_data(row):
r = [i.strip() for i in re.findall('(?<=:).*?(?=<)', row['description'])]
return float(r[0].split('℃')[0]), r[1][:-2], int(r[2][:-1]), pd.to_datetime(r[3], format='%d/%m/%Y %H%M')
df[['Temperature', 'Rainfall', 'RH', 'Reported_Time']] = df.apply(extract_data, axis=1, result_type='expand')
df['Status'] = df['title'].str.split(': ').str[-1]
final_df = df[['label', 'Status', 'Temperature', 'Rainfall', 'RH', 'Reported_Time']]
Output:
label | Status | Temperature | Rainfall | RH | Reported_Time | |
---|---|---|---|---|---|---|
0 | Jaffna | cloudy | 31.2 | 0 | 65 | 2022-10-08 11:30:00 |
1 | Mannar | Partly Cloudy | 29.6 | 0 | 75 | 2022-10-08 11:30:00 |
2 | Vavuniya | Partly Cloudy | 33.2 | 0 | 70 | 2022-10-08 11:30:00 |
3 | Trincomalee | Partly Cloudy | 32 | 0 | 60 | 2022-10-08 11:30:00 |
4 | Anuradhapura | Partly Cloudy | 33 | 0 | 60 | 2022-10-08 11:30:00 |
5 | Maha Iluppallama | Partly Cloudy | 31.4 | 0 | 55 | 2022-10-08 11:30:00 |
6 | Polonnaruwa | Partly Cloudy | 33.8 | 0 | 45 | 2022-10-08 11:30:00 |
7 | Puttalam | cloudy | 31.6 | 0 | 60 | 2022-10-08 11:30:00 |
8 | Kurunegala | Partly Cloudy | 32.9 | 0 | 60 | 2022-10-08 11:30:00 |
9 | Batticaloa | fair | 31.4 | 0 | 80 | 2022-10-08 11:30:00 |
10 | Badulla | cloudy | 28.6 | 0 | 55 | 2022-10-08 11:30:00 |
11 | Katugastota | cloudy | 27.4 | 0 | 70 | 2022-10-08 11:30:00 |
12 | Katunayake | cloudy | 29.2 | 0 | 70 | 2022-10-08 11:30:00 |
13 | Colombo | cloudy | 30.1 | 0 | 70 | 2022-10-08 11:30:00 |
14 | Rathmalana | cloudy | 29.8 | 0 | 70 | 2022-10-08 11:30:00 |
15 | Nuwara Eliya | cloudy | 19 | 0 | 75 | 2022-10-08 11:30:00 |
16 | Bandarawela | Haze | 24.8 | 0 | 55 | 2022-10-08 11:30:00 |
17 | Ratnapura | Haze | 28.9 | 0 | 70 | 2022-10-08 11:30:00 |
18 | Monaragala | cloudy | 31 | 0 | 65 | 2022-10-08 11:30:00 |
19 | Mattala | Partly Cloudy | 34.1 | 0 | 50 | 2022-10-08 11:30:00 |
20 | Hambanthota | Partly Cloudy | 34.6 | 0 | 45 | 2022-10-08 11:30:00 |
21 | Galle | Partly Cloudy | 28 | 0 | 80 | 2022-10-08 11:30:00 |
22 | Pottuvil | Partly Cloudy | 30 | 0 | 75 | 2022-10-08 11:30:00 |
23 | Mullaitivu | Partly Cloudy | 31.3 | 0 | 65 | 2022-10-08 11:30:00 |