Can't read data with BeautifulSoup-CodePudding

I want to scrape table data from this link: https://ourworldindata.org/grapher/pollution-deaths-from-fossil-fuels?tab=table

I use this code:

import pandas as pd
from bs4 import BeautifulSoup as bs
import requests 

url="https://ourworldindata.org/grapher/pollution-deaths-from-fossil-fuels?tab=table"
soup=bs(requests.get(url).content,'html.parser')

soup.select("body[class='StandaloneGrapherOrExplorerPage'] table[class='data-table']")

But beautifulsoup can't read html data.

CodePudding user response：

The fastest and most practical solution to take all the data directly from the request

import pandas as pd
import requests

data_url = 'https://ourworldindata.org/grapher/data/variables/data/145500.json'
meta_url = 'https://ourworldindata.org/grapher/data/variables/metadata/145500.json'
data = requests.get(data_url).json()['values']
meta = requests.get(meta_url).json()['dimensions']['entities']['values']
df = pd.DataFrame([[x['name'], data[i]] for i, x in enumerate(meta)])
print(df)

OUTPUT:

0              United Kingdom    27905
1                     Ireland      970
2                      France    30802
3                     Belgium     7305
4                 Netherlands     9412
..                        ...      ...
185                 East Asia  1698062
186                 West Asia   183533
187                Micronesia        2
188  South and Southeast Asia   942806
189     Sudan and South Sudan     2937

CodePudding user response：

Table is being loaded dynamically, from an API which returns a json object which is further mapped to country names values in page. While there would be ways to extract the mapping logic from the script, then get the json object and map it then form the dataframe, that would be a quite complex solution.

The easiest way to get that table as seen on page is to use selenium:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd


chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
# chrome_options.add_argument("--headless")
chrome_options.add_argument("start-maximized")
chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
chrome_options.add_experimental_option('useAutomationExtension', False)
chrome_options.add_argument("--disable-blink-features")
chrome_options.add_argument("--disable-blink-features=AutomationControlled")

webdriver_service = Service("chromedriver/chromedriver") ## path to where you saved chromedriver binary
browser = webdriver.Chrome(service=webdriver_service, options=chrome_options)

browser.get("https://ourworldindata.org/grapher/pollution-deaths-from-fossil-fuels?tab=table")
dfs = pd.read_html(str(browser.page_source))
dfs[0]

This returns a dataframe with 190 rows × 2 columns:

Country Annual deaths attributed air pollution from fossil fuelsdeaths • 2015
0   Afghanistan 2494
1   Africa  67132
2   Albania 1308
3   Algeria 2008
4   Angola  622
... ... ...

CodePudding user response：

The table data is being loaded by using API as other already said. So, we are going to make a solution for any table that are in 'ourworldindata'.

Here's the code to solve your problem:

import requests
from bs4 import BeautifulSoup as bs4
import pandas as pd

# Loading page:
page=requests.get("https://ourworldindata.org/grapher/pollution-deaths-from-fossil-fuels?tab=table")
soup=bs4(page.content,"lxml")

# To build the table, first of all we have to get the data(JSON) file. For this here's the code to find JSON files:
urls=["https://ourworldindata.org" x["href"] for x in soup.find_all("link",{"rel":"preload"})]

# Now we get the JSON file, we have to take out all of the data from it:
page=requests.get(urls[0]).json()
df=pd.DataFrame(page)
df.rename(columns = {'entities':'id'}, inplace = True)

# For second JSON file:
page1=requests.get(urls[1]).json()
df2=pd.DataFrame(page1["dimensions"]["entities"]["values"])

# We got all of out needed data, we will merge it and make a single DataFrame:
merged=pd.merge(df2,df)
merged.drop(columns=["code","id"],inplace=True)

# Now, we have to shape this DataFrame in that form, where index will be name of county and different columns will be years:
merged=(merged.assign(idx=merged.groupby(['name']).cumcount())
   .pivot(index='idx', columns=['name'], values='values')
).reset_index(drop=True).T
merged.columns=df['years'].unique()

print(merged)

Output:

             2015
name    
Afghanistan  2494
Africa       67132
Albania      1308
Algeria      2008
Angola       622
...          ...
West Asia    183533
World        3608196
Yemen        1838
Zambia       348
Zimbabwe     407
[190 rows × 1 columns]

CodePudding user response：

The program below will print you the page's table data

from bs4 import BeautifulSoup
from urllib.request import urlopen,Request

req = Request("https://ourworldindata.org/grapher/pollution-deaths-from-fossil-fuels?tab=table", headers={'User-Agent': 'Mozilla/5.0'})
html_code = urlopen(req).read().decode("utf-8")
Soup = BeautifulSoup(html_code, "lxml")
PL = Soup.find_all('script')

print (PL)

For it to run you may as well download lxml with this command on your CMD --> pip install lxml and also urllib --> pip install urllib3

if you don't already have it