Home > Back-end >  How can I convert the beautiful soup text to JSON object?
How can I convert the beautiful soup text to JSON object?

Time:04-07

What I'm trying to do is to convert the scraped data I get from the URL to JSON object.

import bs4 as bs
from urllib.request import Request, urlopen
import json

req = Request('https://www.worldometers.info/gdp/albania-gdp/',
              headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
soup = bs.BeautifulSoup(webpage, 'html.parser')

gdp = soup.select_one('span[style="margin-right:7px"]')
# print('gdp:', type(gdp.text))

gdp_growth_rate = soup.find('br').next_sibling
# print('gdp growth_rate', type(gdp_growth_rate.text))

gdp_historic = soup.find(
    'table', class_='table table-striped table-bordered table-hover table-condensed table-list')
# print('gdp historic: ', type(gdp_historic.text, sep='\n'))

The idea is for the data I get from the table, to convert to JSON. The purpose of this is to create an API.

CodePudding user response:

In general

How can I convert the beautiful soup text to json object?

You could convert any python object (dict, list, tuple, string,...) into a JSON string by using the json.dumps() method:

json.dumps(
    dict(
        gdp = soup.select_one('span[style="margin-right:7px"]').text
    )
)

Output:

{"gdp": "$13,038,538,300"}
Table to JSON

Best practice in my opinion scraping a basic table is pandas.read_html() it uses beautifulsoup under the hood and provides multiple formats to convert your data e.g. .to_json().

Not clear from your question is what JSON string format you may expect.

pandas.to_json() uses a parameter orient that might be usefull and provides a format for your needs - Standard for DataFrame is the value columns that leads to dict like structure {column -> {index -> value}}

Example
import pandas as pd
import requests
pd.read_html(requests.get('https://www.worldometers.info/gdp/albania-gdp/',
                          headers={'User-agent': 'Mozilla/5.0'}
                         ).text
            )[1].to_json()
Output

First 5 rows as sample.

{"Year":{"0":2017,"1":2016,"2":2015,"3":2014,"4":2013},"GDP Nominal (Current USD)":{"0":"$13,038,538,300","1":"$11,883,682,171","2":"$11,386,931,490","3":"$13,228,247,844","4":"$12,776,280,961"},"GDP Real (Inflation adj.)":{"0":"$13,986,932,579","1":"$13,470,274,302","2":"$13,033,647,123","3":"$12,750,584,155","4":"$12,528,823,971"},"GDP change":{"0":"3.84%","1":"3.35%","2":"2.22%","3":"1.77%","4":"1.00%"},"GDP per capita":{"0":"$4,850","1":"$4,667","2":"$4,509","3":"$4,402","4":"$4,315"},"Pop. change":{"0":"-0.08 %","1":"-0.14 %","2":"-0.20 %","3":"-0.26 %","4":"-0.35 %"},"Population":{"0":2884169,"1":2886438,"2":2890513,"3":2896305,"4":2903790}}

CodePudding user response:

The table extraction is mostly answered here, although not the column names.

I have used the same approach, but as you are using some old libraries, e.g. urllib, this is a more modern way to do it. I have also used pandas to parse the table and then extract to json easily.

# These libraries are easiest
from bs4 import BeautifulSoup
import requests
import json
import pandas as pd

# Download page
req = requests.get('https://www.worldometers.info/gdp/albania-gdp/',
              headers={'User-Agent': 'Mozilla/5.0'})
soup = BeautifulSoup(req.content, 'html.parser')

# Extract table
gdp_historic = soup.find(
    'table', class_='table table-striped table-bordered table-hover table-condensed table-list')

table_body = gdp_historic.find('tbody')
data = []
rows = table_body.find_all('tr')
for row in rows:
    cols = row.find_all('td')
    cols = [ele.text.strip() for ele in cols]
    data.append([ele for ele in cols if ele]) # Get rid of empty values


# Extract column names
colnames = [heading.text for heading in gdp_historic.findAll('th')]


# Convert to json
pd.DataFrame(data, columns=colnames).to_json()

Output:

'{"Year":{"0":"2017","1":"2016","2":"2015","3":"2014","4":"2013","5":"2012","6":"2011","7":"2010","8":"2009","9":"2008","10":"2007","11":"2006","12":"2005","13":"2004","14":"2003","15":"2002","16":"2001","17":"2000","18":"1999","19":"1998","20":"1997","21":"1996","22":"1995","23":"1994"},"GDP Nominal (Current USD) ":{"0":"$13,038,538,300","1":"$11,883,682,171","2":"$11,386,931,490","3":"$13,228,247,844","4":"$12,776,280,961","5":"$12,319,784,886","6":"$12,890,866,743","7":"$11,926,957,255","8":"$12,044,208,086","9":"$12,881,353,508","10":"$10,677,324,144","11":"$8,896,072,919","12":"$8,052,073,539","13":"$7,184,685,782","14":"$5,611,496,257","15":"$4,348,068,242","16":"$3,922,100,794","17":"$3,480,355,258","18":"$3,212,121,651","19":"$2,545,964,541","20":"$2,258,513,974","21":"$3,199,641,336","22":"$2,392,764,853","23":"$1,880,951,520"},"GDP Real  (Inflation adj.) ":
<truncated>
  • Related