After creating a json file, I want to translate it to an database sql file. I tried using create_engine class to help me with that. After executing the code it gave me a Json decode error even though I checked the file it's in a correct json form.
import requests
from bs4 import BeautifulSoup as bs
from random import randint
from time import sleep
import json
import pandas as pd
from sqlalchemy import create_engine
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36"}
output='['
print("[",end='')
for page in range(1,3):
req = requests.get('https://www.zomato.com/beirut/deek-duke-ashrafieh/reviews?page=' str(page) '&sort=dd&filter=reviews-dd',headers=headers)
soup = bs(req.text, 'html.parser')
for tag in soup.find_all('script')[1]:
if page==2:
print(tag[1448:-2])
output =tag[1448:-2]
else: print(tag[1448:-2],end=",")
output =tag[1448:-2]
sleep(randint(2,10))
print("]")
output =']'
with open('json_data.json', 'w') as outfile:
outfile.write(output)
with open('json_data.json') as f:
data = json.load(f)
df = pd.DataFrame(data)
engine = create_engine("sqlite:///my_data.db")
df.to_sql("table_name",conn=engine)
CodePudding user response:
You have few mistakes.
Main problem is that you crop tag
in wrong places so you have ews": [{"author": ...}
but it should be "reviews": [{"author": ...}]
or {"author": ...}
and then it makes sense to put it in [ ]
Second: you have wrong indentations and second output =...
is outside else
so it may add the same elements two times
Third: you print with end=','
but you forgot to add ,
in output - so elements are not separated.
You write in file and read it from file but you could use directly output
with json.loads()
.
It would be simpler to append elements to normal list and later convert all to output
using join()
output = "[" ",".join(all_items) "]"
This code works for me - but it may need to edit data to extract details to separated columns
import json
from time import sleep
from random import randint
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36"
}
all_items = []
for page in range(1,3 ):
print(f'--- page: {page} ---')
url = f'https://www.zomato.com/beirut/deek-duke-ashrafieh/reviews?page={page}&sort=dd&filter=reviews-dd'
response = requests.get(url ,headers=headers)
soup = bs(response.text, 'html.parser')
for tag in soup.find_all('script')[1]:
item = tag[1454:-2]
all_items.append(item)
print(item)
print('--------')
sleep(randint(2,10))
# --- after loop ---
output = '[' ',\n'.join(all_items) ']'
data = json.loads(output)
df = pd.DataFrame(data)
print(df)
print(df.columns)
Result:
author ... reviewRating
0 Youssef Semaan ... {'@type': 'Rating', 'ratingValue': 1, 'bestRat...
1 Doublethatfood ... {'@type': 'Rating', 'ratingValue': 5, 'bestRat...
2 Zeina El Zein ... {'@type': 'Rating', 'ratingValue': 4, 'bestRat...
3 Leilajoheir ... {'@type': 'Rating', 'ratingValue': 4, 'bestRat...
4 Eliane Lteif ... {'@type': 'Rating', 'ratingValue': 4, 'bestRat...
5 Rabih Hjeily ... {'@type': 'Rating', 'ratingValue': 5, 'bestRat...
6 Rami Bazzal ... {'@type': 'Rating', 'ratingValue': 1, 'bestRat...
7 Hisham Buteen ... {'@type': 'Rating', 'ratingValue': 1, 'bestRat...
8 Engred Bou Aoun ... {'@type': 'Rating', 'ratingValue': 1, 'bestRat...
9 Ahmad Harb ... {'@type': 'Rating', 'ratingValue': 4, 'bestRat...
[10 rows x 4 columns]
Index(['author', 'url', 'description', 'reviewRating'], type='object')
EDIT::
After adding
df = df.join(df['reviewRating'].apply(pd.Series))
I get details from reviewRating
in separated columns
author url ... bestRating worstRating
0 Youssef Semaan https://www.zoma.to/qvQeaRp ... 5 0
1 Doublethatfood https://www.zoma.to/kaMAPOn ... 5 0
2 Zeina El Zein https://www.zoma.to/BNMYBar ... 5 0
3 Leilajoheir https://www.zoma.to/ddGNypx ... 5 0
4 Eliane Lteif https://www.zoma.to/oQdmkyE ... 5 0
5 Rabih Hjeily https://www.zoma.to/eaOrzBn ... 5 0
6 Rami Bazzal https://www.zoma.to/YKGYppO ... 5 0
7 Hisham Buteen https://www.zoma.to/ZPodbDv ... 5 0
8 Engred Bou Aoun https://www.zoma.to/ddqNLqb ... 5 0
9 Ahmad Harb https://www.zoma.to/QbAvDZQ ... 5 0
[10 rows x 8 columns]
Index(['author', 'url', 'description', 'reviewRating', '@type', 'ratingValue', 'bestRating', 'worstRating'], dtype='object')
EDIT:
To add table to database you have to use con=
instead of conn=
.
For SQLite
you have to drop column reviewRating
because it has dictionary and SQLite
has problem with this type of data.
df = df.drop(columns='reviewRating')
If you want to replace existing table then you need also if_exists='replace'
If you want to add new data to existing table then you need also if_exists='append'
from sqlalchemy import create_engine
df = pd.DataFrame(data)
df = df.join(df['reviewRating'].apply(pd.Series))
df = df.drop(columns='reviewRating')
engine = create_engine("sqlite:///my_data.db")
df.to_sql("table_name", con=engine, if_exists='replace')
#df.to_sql("table_name", con=engine, if_exists='append')
For SQL server management
you need different value in create_engine()
- see SQLAlchemy: Microsoft SQL Server.