I want to scrape table data from this link: https://ourworldindata.org/grapher/pollution-deaths-from-fossil-fuels?tab=table
I use this code:
import pandas as pd
from bs4 import BeautifulSoup as bs
import requests
url="https://ourworldindata.org/grapher/pollution-deaths-from-fossil-fuels?tab=table"
soup=bs(requests.get(url).content,'html.parser')
soup.select("body[class='StandaloneGrapherOrExplorerPage'] table[class='data-table']")
But beautifulsoup can't read html data.
CodePudding user response:
The fastest and most practical solution to take all the data directly from the request
import pandas as pd
import requests
data_url = 'https://ourworldindata.org/grapher/data/variables/data/145500.json'
meta_url = 'https://ourworldindata.org/grapher/data/variables/metadata/145500.json'
data = requests.get(data_url).json()['values']
meta = requests.get(meta_url).json()['dimensions']['entities']['values']
df = pd.DataFrame([[x['name'], data[i]] for i, x in enumerate(meta)])
print(df)
OUTPUT:
0 United Kingdom 27905
1 Ireland 970
2 France 30802
3 Belgium 7305
4 Netherlands 9412
.. ... ...
185 East Asia 1698062
186 West Asia 183533
187 Micronesia 2
188 South and Southeast Asia 942806
189 Sudan and South Sudan 2937
CodePudding user response:
Table is being loaded dynamically, from an API which returns a json object which is further mapped to country names values in page. While there would be ways to extract the mapping logic from the script, then get the json object and map it then form the dataframe, that would be a quite complex solution.
The easiest way to get that table as seen on page is to use selenium:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
# chrome_options.add_argument("--headless")
chrome_options.add_argument("start-maximized")
chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
chrome_options.add_experimental_option('useAutomationExtension', False)
chrome_options.add_argument("--disable-blink-features")
chrome_options.add_argument("--disable-blink-features=AutomationControlled")
webdriver_service = Service("chromedriver/chromedriver") ## path to where you saved chromedriver binary
browser = webdriver.Chrome(service=webdriver_service, options=chrome_options)
browser.get("https://ourworldindata.org/grapher/pollution-deaths-from-fossil-fuels?tab=table")
dfs = pd.read_html(str(browser.page_source))
dfs[0]
This returns a dataframe with 190 rows × 2 columns:
Country Annual deaths attributed air pollution from fossil fuelsdeaths • 2015
0 Afghanistan 2494
1 Africa 67132
2 Albania 1308
3 Algeria 2008
4 Angola 622
... ... ...
CodePudding user response:
The table data is being loaded by using API as other already said. So, we are going to make a solution for any table that are in 'ourworldindata'.
Here's the code to solve your problem:
import requests
from bs4 import BeautifulSoup as bs4
import pandas as pd
# Loading page:
page=requests.get("https://ourworldindata.org/grapher/pollution-deaths-from-fossil-fuels?tab=table")
soup=bs4(page.content,"lxml")
# To build the table, first of all we have to get the data(JSON) file. For this here's the code to find JSON files:
urls=["https://ourworldindata.org" x["href"] for x in soup.find_all("link",{"rel":"preload"})]
# Now we get the JSON file, we have to take out all of the data from it:
page=requests.get(urls[0]).json()
df=pd.DataFrame(page)
df.rename(columns = {'entities':'id'}, inplace = True)
# For second JSON file:
page1=requests.get(urls[1]).json()
df2=pd.DataFrame(page1["dimensions"]["entities"]["values"])
# We got all of out needed data, we will merge it and make a single DataFrame:
merged=pd.merge(df2,df)
merged.drop(columns=["code","id"],inplace=True)
# Now, we have to shape this DataFrame in that form, where index will be name of county and different columns will be years:
merged=(merged.assign(idx=merged.groupby(['name']).cumcount())
.pivot(index='idx', columns=['name'], values='values')
).reset_index(drop=True).T
merged.columns=df['years'].unique()
print(merged)
Output:
2015
name
Afghanistan 2494
Africa 67132
Albania 1308
Algeria 2008
Angola 622
... ...
West Asia 183533
World 3608196
Yemen 1838
Zambia 348
Zimbabwe 407
[190 rows × 1 columns]
CodePudding user response:
The program below will print you the page's table data
from bs4 import BeautifulSoup
from urllib.request import urlopen,Request
req = Request("https://ourworldindata.org/grapher/pollution-deaths-from-fossil-fuels?tab=table", headers={'User-Agent': 'Mozilla/5.0'})
html_code = urlopen(req).read().decode("utf-8")
Soup = BeautifulSoup(html_code, "lxml")
PL = Soup.find_all('script')
print (PL)
For it to run you may as well download lxml with this command on your CMD --> pip install lxml and also urllib --> pip install urllib3
if you don't already have it