Transforming Json File onto SQl Server-CodePudding

After creating a json file, I want to translate it to an database sql file. I tried using create_engine class to help me with that. After executing the code it gave me a Json decode error even though I checked the file it's in a correct json form.

import requests
from bs4 import BeautifulSoup as bs
from random import randint
from time import sleep
import json
import pandas as pd
from sqlalchemy import create_engine


headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36"}
output='['
print("[",end='')
for page in range(1,3): 
    req = requests.get('https://www.zomato.com/beirut/deek-duke-ashrafieh/reviews?page='   str(page)   '&sort=dd&filter=reviews-dd',headers=headers)
    soup = bs(req.text, 'html.parser')
    for tag in soup.find_all('script')[1]:
        if page==2:
            print(tag[1448:-2])
            output =tag[1448:-2]
        else:  print(tag[1448:-2],end=",") 
        output =tag[1448:-2]
        


    sleep(randint(2,10))
print("]")
output =']'
with open('json_data.json', 'w') as outfile:
    outfile.write(output)

with open('json_data.json') as f:
    data = json.load(f)


df = pd.DataFrame(data)

engine = create_engine("sqlite:///my_data.db")
df.to_sql("table_name",conn=engine)

CodePudding user response：

You have few mistakes.

Main problem is that you crop tag in wrong places so you have ews": [{"author": ...} but it should be "reviews": [{"author": ...}] or {"author": ...} and then it makes sense to put it in [ ]

Second: you have wrong indentations and second output =... is outside else so it may add the same elements two times

Third: you print with end=',' but you forgot to add , in output - so elements are not separated.

You write in file and read it from file but you could use directly output with json.loads().

It would be simpler to append elements to normal list and later convert all to output using join()

output = "["   ",".join(all_items)   "]"

This code works for me - but it may need to edit data to extract details to separated columns

import json
from time import sleep
from random import randint

import requests
from bs4 import BeautifulSoup as bs
import pandas as pd

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36"
}

all_items = []

for page in range(1,3 ): 
    print(f'--- page: {page} ---')

    url = f'https://www.zomato.com/beirut/deek-duke-ashrafieh/reviews?page={page}&sort=dd&filter=reviews-dd'
    response = requests.get(url ,headers=headers)
    soup = bs(response.text, 'html.parser')
    
    for tag in soup.find_all('script')[1]:
        item = tag[1454:-2]
        all_items.append(item)
        print(item)
        print('--------')

    sleep(randint(2,10))

# --- after loop ---

output = '['   ',\n'.join(all_items)   ']'

data = json.loads(output)

df = pd.DataFrame(data)

print(df)
print(df.columns)

Result:

           author  ...                                       reviewRating
0   Youssef Semaan  ...  {'@type': 'Rating', 'ratingValue': 1, 'bestRat...
1   Doublethatfood  ...  {'@type': 'Rating', 'ratingValue': 5, 'bestRat...
2    Zeina El Zein  ...  {'@type': 'Rating', 'ratingValue': 4, 'bestRat...
3      Leilajoheir  ...  {'@type': 'Rating', 'ratingValue': 4, 'bestRat...
4     Eliane Lteif  ...  {'@type': 'Rating', 'ratingValue': 4, 'bestRat...
5     Rabih Hjeily  ...  {'@type': 'Rating', 'ratingValue': 5, 'bestRat...
6      Rami Bazzal  ...  {'@type': 'Rating', 'ratingValue': 1, 'bestRat...
7    Hisham Buteen  ...  {'@type': 'Rating', 'ratingValue': 1, 'bestRat...
8  Engred Bou Aoun  ...  {'@type': 'Rating', 'ratingValue': 1, 'bestRat...
9       Ahmad Harb  ...  {'@type': 'Rating', 'ratingValue': 4, 'bestRat...

[10 rows x 4 columns]

Index(['author', 'url', 'description', 'reviewRating'], type='object')

EDIT::

After adding

df = df.join(df['reviewRating'].apply(pd.Series))

I get details from reviewRating in separated columns

            author                          url  ... bestRating worstRating
0   Youssef Semaan  https://www.zoma.to/qvQeaRp  ...          5           0
1   Doublethatfood  https://www.zoma.to/kaMAPOn  ...          5           0
2    Zeina El Zein  https://www.zoma.to/BNMYBar  ...          5           0
3      Leilajoheir  https://www.zoma.to/ddGNypx  ...          5           0
4     Eliane Lteif  https://www.zoma.to/oQdmkyE  ...          5           0
5     Rabih Hjeily  https://www.zoma.to/eaOrzBn  ...          5           0
6      Rami Bazzal  https://www.zoma.to/YKGYppO  ...          5           0
7    Hisham Buteen  https://www.zoma.to/ZPodbDv  ...          5           0
8  Engred Bou Aoun  https://www.zoma.to/ddqNLqb  ...          5           0
9       Ahmad Harb  https://www.zoma.to/QbAvDZQ  ...          5           0

[10 rows x 8 columns]

Index(['author', 'url', 'description', 'reviewRating', '@type', 'ratingValue', 'bestRating', 'worstRating'], dtype='object')

EDIT:

To add table to database you have to use con= instead of conn=.

For SQLite you have to drop column reviewRating because it has dictionary and SQLite has problem with this type of data.

df = df.drop(columns='reviewRating')

If you want to replace existing table then you need also if_exists='replace'

If you want to add new data to existing table then you need also if_exists='append'

from sqlalchemy import create_engine

df = pd.DataFrame(data)

df = df.join(df['reviewRating'].apply(pd.Series))
df = df.drop(columns='reviewRating')

engine = create_engine("sqlite:///my_data.db")

df.to_sql("table_name", con=engine, if_exists='replace') 
#df.to_sql("table_name", con=engine, if_exists='append')

For SQL server management you need different value in create_engine() - see SQLAlchemy: Microsoft SQL Server.