Home > other >  Why 'div' return None when I'm trying to get the contents below it?
Why 'div' return None when I'm trying to get the contents below it?

Time:07-29

I was trying to start my mini web-scraping project with the website:https://waqi.info/#/c/10.017/93.166/3.7z

And I try to get the contents under this structure:

<div class='ranking-horizontal-list  ranking-list ranking-countries'>

Here is my code:

from bs4 import BeautifulSoup
import requests


url = 'https://waqi.info/#/c/13.068/93.22/3.7z'
html_text = requests.get(url).text
doc = BeautifulSoup(html_text, 'html.parser')

tbody = doc.find('div', class_ = 'ranking-horizontal-list  ranking-list ranking- 
countries').contents

print(tbody)

I've changed the code to :

tbody = doc.find('div', {'class' = 'ranking-horizontal-list ranking-list ranking-countries'})

and still don't work, I always get result 'None', I don't knows where is the problem.

Thanks for your Helps!

CodePudding user response:

That page is loading the information from an api - after the page load, javascript will access that api and fetch the data as json. You can see this if you go to Dev Tools - Network tab. Therefore, you would be better off scraping that API directly:

import requests
import pandas as pd

url = 'https://waqi.info/rtdata/markers-1659088619/level1.json'
r = requests.get(url)
df = pd.DataFrame(r.json())
print(df)

This will return a dataframe with 812 rows × 6 columns:

g   n   u   a   t   x
0   [-40.584478665958, -73.11871982209] Osorno, Chile   2022-07-29 05:00:00 160 -04:00  S000428
1   [-22.44283906304, -68.932546346863] Colegio Pedro Vergara Keller, Chile 2022-07-29 05:00:00 20  -04:00  S000417
2   [57.2591972, -111.0385833]  Wapasu, Alberta, Canada 2022-07-29 03:00:00 110 -06:00  S009298
3   [36.0464, 103.831]  Railway Design Institute, Lanzhou (兰州铁路设计院) 2022-07-29 17:00:00 98   08:00  S001407
4   [44.849647300541, -0.544994768749]  Bastide, Bordeaux, Aquitaine, France    2022-07-29 10:00:00 62   02:00  S005056
... ... ... ... ... ... ...
  • Related