Home > Software design >  Scraping Vaccination Data using BeautifulSoup (python)
Scraping Vaccination Data using BeautifulSoup (python)

Time:11-30

I am new to scraping :) . I would like to scrape a website to get information about vaccination. Here is the website:https://ourworldindata.org/covid-vaccinations

My goal is to obtain the table with three columns:

  • "Country"
  • "Share of people fully vaccinated against COVID-19"
  • "Share of people only partly vaccinated against COVID-19"

Here is my code:

# importing basic libraries
import requests
from bs4 import BeautifulSoup


# request for getting the target html.
def get_html(URL):
    scrape_result = requests.get(URL)
    return scrape_result.text
vac_html = get_html("https://ourworldindata.org/covid-vaccinations")

# the BeautifulSoup library for scraping the data, with "html.parser" for parsing.
beatiful_soup = BeautifulSoup(vac_html, "html.parser")

# view the html script.
print(beatiful_soup.prettify())

# finding the content of interest 
get_table = beatiful_soup.find_all("tr")

for x in get_table:
    print("*********")
    print(x)

Current output: The entire webpage as HTML. This is a fraction of it :


'\n<!DOCTYPE html>\n<!--[if IE 8]> <html lang="en" > <![endif]-->\n<!--[if IE 9]> <html lang="en" > <![endif]-->\n<!--[if !IE]><!-->\n<html lang="en">\n<!--<![endif]-->\n<head>\n<meta charset="utf-8">\n<meta http-equiv="X-UA-Compatible" content="IE=edge">\n<meta name="viewport" content="width=device-width, initial-scale=1">\n<title>COVID Live Update: 261,656,911 Cases and 5,216,375 Deaths from the Coronavirus - Worldometer</title>\n<meta name="description" content="Live statistics and coronavirus news tracking the number of confirmed cases, recovered patients, tests, and death toll due to the COVID-19 coronavirus from Wuhan, China. Coronavirus counter with new cases, deaths, and number of tests per 1 Million population. Historical data and info. Daily charts, graphs, news and updates">\n\n<link rel="shortcut icon" href="/favicon/favicon.ico" type="image/x-icon">\n<link rel="apple-touch-icon" sizes="57x57" href="/favicon/apple-icon-57x57.png">\n<link rel="apple-touch-icon" sizes="60x60" href="/favicon/apple-icon-60x60.png">\n<link rel="apple-touch-icon" sizes="72x72" href="/favicon/apple-icon-72x72.png">\n<link rel="apple-touch-icon" sizes="76x76" href="/favicon/apple-icon-76x76.png">\n<link rel="apple-touch-icon" sizes="114x114"

Unfortunately, it is not producing the information I liked to see. Does anyone have some experience in web scraping and could quickly review my code?

Thanks in advance for your help!

CodePudding user response:

Just took a quick look at that website. I suggest instead of using beautiful soup, you should just use the request that they are using to get the data. In the network request (viewed using dev tools) you will find a GET request to https://covid.ourworldindata.org/data/internal/megafile--vaccinations.json you can go back to the site yourself and try this. If you go to that link above you can see that it returns a nice JSON object that you can parse.

CodePudding user response:

It's all there if you get the data directly from the source:

import requests
import pandas as pd

url = "https://covid.ourworldindata.org/data/internal/megafile--vaccinations-bydose.json"
jsonData = requests.get(url).json()

df = pd.DataFrame(jsonData)

Output:

print(df)
          location  ... people_partly_vaccinated_per_hundred
0      Afghanistan  ...                             0.987197
1      Afghanistan  ...                             0.986009
2      Afghanistan  ...                             0.952562
3      Afghanistan  ...                             0.924529
4      Afghanistan  ...                             0.918366
           ...  ...                                  ...
30218     Zimbabwe  ...                             6.310471
30219     Zimbabwe  ...                             6.384688
30220     Zimbabwe  ...                             6.429645
30221     Zimbabwe  ...                             6.429439
30222     Zimbabwe  ...                             6.447568

[30223 rows x 6 columns]
  • Related