What I'm trying to do is to convert the scraped data I get from the URL to JSON
object.
import bs4 as bs
from urllib.request import Request, urlopen
import json
req = Request('https://www.worldometers.info/gdp/albania-gdp/',
headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
soup = bs.BeautifulSoup(webpage, 'html.parser')
gdp = soup.select_one('span[style="margin-right:7px"]')
# print('gdp:', type(gdp.text))
gdp_growth_rate = soup.find('br').next_sibling
# print('gdp growth_rate', type(gdp_growth_rate.text))
gdp_historic = soup.find(
'table', class_='table table-striped table-bordered table-hover table-condensed table-list')
# print('gdp historic: ', type(gdp_historic.text, sep='\n'))
The idea is for the data I get from the table, to convert to JSON
. The purpose of this is to create an API.
CodePudding user response:
In general
How can I convert the beautiful soup text to json object?
You could convert any python object
(dict, list, tuple, string,...) into a JSON string
by using the json.dumps()
method:
json.dumps(
dict(
gdp = soup.select_one('span[style="margin-right:7px"]').text
)
)
Output:
{"gdp": "$13,038,538,300"}
Table to JSON
Best practice in my opinion scraping a basic table is pandas.read_html()
it uses beautifulsoup
under the hood and provides multiple formats to convert your data e.g. .to_json()
.
Not clear from your question is what JSON string format you may expect.
pandas.to_json()
uses a parameter orient
that might be usefull and provides a format for your needs - Standard for DataFrame
is the value columns
that leads to dict like structure {column -> {index -> value}}
Example
import pandas as pd
import requests
pd.read_html(requests.get('https://www.worldometers.info/gdp/albania-gdp/',
headers={'User-agent': 'Mozilla/5.0'}
).text
)[1].to_json()
Output
First 5 rows as sample.
{"Year":{"0":2017,"1":2016,"2":2015,"3":2014,"4":2013},"GDP Nominal (Current USD)":{"0":"$13,038,538,300","1":"$11,883,682,171","2":"$11,386,931,490","3":"$13,228,247,844","4":"$12,776,280,961"},"GDP Real (Inflation adj.)":{"0":"$13,986,932,579","1":"$13,470,274,302","2":"$13,033,647,123","3":"$12,750,584,155","4":"$12,528,823,971"},"GDP change":{"0":"3.84%","1":"3.35%","2":"2.22%","3":"1.77%","4":"1.00%"},"GDP per capita":{"0":"$4,850","1":"$4,667","2":"$4,509","3":"$4,402","4":"$4,315"},"Pop. change":{"0":"-0.08 %","1":"-0.14 %","2":"-0.20 %","3":"-0.26 %","4":"-0.35 %"},"Population":{"0":2884169,"1":2886438,"2":2890513,"3":2896305,"4":2903790}}
CodePudding user response:
The table extraction is mostly answered here, although not the column names.
I have used the same approach, but as you are using some old libraries, e.g. urllib
, this is a more modern way to do it. I have also used pandas
to parse the table and then extract to json
easily.
# These libraries are easiest
from bs4 import BeautifulSoup
import requests
import json
import pandas as pd
# Download page
req = requests.get('https://www.worldometers.info/gdp/albania-gdp/',
headers={'User-Agent': 'Mozilla/5.0'})
soup = BeautifulSoup(req.content, 'html.parser')
# Extract table
gdp_historic = soup.find(
'table', class_='table table-striped table-bordered table-hover table-condensed table-list')
table_body = gdp_historic.find('tbody')
data = []
rows = table_body.find_all('tr')
for row in rows:
cols = row.find_all('td')
cols = [ele.text.strip() for ele in cols]
data.append([ele for ele in cols if ele]) # Get rid of empty values
# Extract column names
colnames = [heading.text for heading in gdp_historic.findAll('th')]
# Convert to json
pd.DataFrame(data, columns=colnames).to_json()
Output:
'{"Year":{"0":"2017","1":"2016","2":"2015","3":"2014","4":"2013","5":"2012","6":"2011","7":"2010","8":"2009","9":"2008","10":"2007","11":"2006","12":"2005","13":"2004","14":"2003","15":"2002","16":"2001","17":"2000","18":"1999","19":"1998","20":"1997","21":"1996","22":"1995","23":"1994"},"GDP Nominal (Current USD) ":{"0":"$13,038,538,300","1":"$11,883,682,171","2":"$11,386,931,490","3":"$13,228,247,844","4":"$12,776,280,961","5":"$12,319,784,886","6":"$12,890,866,743","7":"$11,926,957,255","8":"$12,044,208,086","9":"$12,881,353,508","10":"$10,677,324,144","11":"$8,896,072,919","12":"$8,052,073,539","13":"$7,184,685,782","14":"$5,611,496,257","15":"$4,348,068,242","16":"$3,922,100,794","17":"$3,480,355,258","18":"$3,212,121,651","19":"$2,545,964,541","20":"$2,258,513,974","21":"$3,199,641,336","22":"$2,392,764,853","23":"$1,880,951,520"},"GDP Real (Inflation adj.) ":
<truncated>