I was trying to start my mini web-scraping project with the website:https://waqi.info/#/c/10.017/93.166/3.7z
And I try to get the contents under this structure:
<div class='ranking-horizontal-list ranking-list ranking-countries'>
Here is my code:
from bs4 import BeautifulSoup
import requests
url = 'https://waqi.info/#/c/13.068/93.22/3.7z'
html_text = requests.get(url).text
doc = BeautifulSoup(html_text, 'html.parser')
tbody = doc.find('div', class_ = 'ranking-horizontal-list ranking-list ranking-
countries').contents
print(tbody)
I've changed the code to :
tbody = doc.find('div', {'class' = 'ranking-horizontal-list ranking-list ranking-countries'})
and still don't work, I always get result 'None', I don't knows where is the problem.
Thanks for your Helps!
CodePudding user response:
That page is loading the information from an api - after the page load, javascript will access that api and fetch the data as json. You can see this if you go to Dev Tools - Network tab. Therefore, you would be better off scraping that API directly:
import requests
import pandas as pd
url = 'https://waqi.info/rtdata/markers-1659088619/level1.json'
r = requests.get(url)
df = pd.DataFrame(r.json())
print(df)
This will return a dataframe with 812 rows × 6 columns:
g n u a t x
0 [-40.584478665958, -73.11871982209] Osorno, Chile 2022-07-29 05:00:00 160 -04:00 S000428
1 [-22.44283906304, -68.932546346863] Colegio Pedro Vergara Keller, Chile 2022-07-29 05:00:00 20 -04:00 S000417
2 [57.2591972, -111.0385833] Wapasu, Alberta, Canada 2022-07-29 03:00:00 110 -06:00 S009298
3 [36.0464, 103.831] Railway Design Institute, Lanzhou (兰州铁路设计院) 2022-07-29 17:00:00 98 08:00 S001407
4 [44.849647300541, -0.544994768749] Bastide, Bordeaux, Aquitaine, France 2022-07-29 10:00:00 62 02:00 S005056
... ... ... ... ... ... ...