This is the first time I post in StockOverflow.
I am trying to scrape a website but I get a result that I dont know how to convert on dataframe in order to be readeable.
from selenium import webdriver
from selenium_stealth import stealth
from selenium.webdriver.chrome.service import Service
from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime
import requests
driver_path = 'C:\Program Files\chromedriver.exe'
driver = webdriver.Chrome(executable_path=driver_path, chrome_options=options)
stealth(driver,
languages=["en-US", "en"],
vendor="Google Inc.",
platform="Win32",
webgl_vendor="Intel Inc.",
renderer="Intel Iris OpenGL Engine",
fix_hairline=True,
)
url = 'https://ikea.fr'
dic = {"test":[]}
page = requests.get(url)
soup = BeautifulSoup(url, 'html.parser')
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser')'''
And the result i get is this:
<html><head><meta content="light dark" name="color-scheme"/></head><body><pre style="word-wrap: break-word; white-space: pre-wrap;">{"SiteName":"manomano.fr","Description":"manomano : tous vos produits de bricolage, rénovation et jardinage au meilleur prix","TopCountryShares":[{"Value":0.9582353790247653,"Country":250},{"Value":0.01428431578726121,"Country":56},{"Value":0.00360626497244031,"Country":756},{"Value":0.001907518367836589,"Country":124},{"Value":0.0016671906973079764,"Country":638}],"Title":"manomano : achat en ligne bricolage, rénovation et jardinage","Engagments":{"BounceRate":"0.39632307436566677","Month":"07","Year":"2022","PagePerVisit":"5.013184701586454","Visits":"1.710036669747373E7","TimeOnSite":"282.97140337977"},"EstimatedMonthlyVisits":{"2022-02-01":18289643,"2022-03-01":20571776,"2022-04-01":19341861,"2022-05-01":21415927,"2022-06-01":18153351,"2022-07-01":17100366},"GlobalRank":{"Rank":2656},"CountryRank":{"Country":250,"Rank":91},"IsSmall":false,"TrafficSources":{"Social":0.00617010722152418,"Paid Referrals":0.03439823545252397,"Mail":0.014748024044393673,"Referrals":0.026006210925393708,"Search":0.6444821136549667,"Direct":0.27419530870119785},"Category":"Home_and_Garden/Home_and_Garden","CategoryRank":{"Rank":"15","Category":"Home_and_Garden/Home_and_Garden"},"LargeScreenshot":"https://site-images.similarcdn.com/image?url=manomano.fr&t=1&h=2a480fb8d6d2298ffef39594ad8d71d65f5dbf8cba53179589d0c69e6aa3fd67"}</pre></body></html>
Any idea how I can transform this data into something readeable (eg: dataframe?)
Thanks for your help, Sonkar
CodePudding user response:
You don't necessarily need selenium for this, it can be done with requests and pandas:
import requests
import pandas as pd
header = {'Content-Type': 'application/json',
'Accept': 'application/json',
'X-Requested-With': 'XMLHttpRequest',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36'
}
r = requests.get('https://data.similarweb.com/api/v1/data?domain=manomano.fr', headers=header)
# print(r.json())
df = pd.json_normalize(r.json())
print(df)
This will return a dataframe: you can extract other parts of the json response to transform into a dataframe, if you wish:
SiteName Description TopCountryShares Title IsSmall Category LargeScreenshot Engagments.BounceRate Engagments.Month Engagments.Year ... CountryRank.Country CountryRank.Rank TrafficSources.Social TrafficSources.Paid Referrals TrafficSources.Mail TrafficSources.Referrals TrafficSources.Search TrafficSources.Direct CategoryRank.Rank CategoryRank.Category
0 manomano.fr manomano : tous vos produits de bricolage, rén... [{'Value': 0.9582353790247653, 'Country': 250}... manomano : achat en ligne bricolage, rénovatio... False Home_and_Garden/Home_and_Garden https://site-images.similarcdn.com/image?url=m... 0.39632307436566677 07 2022 ... 250 91 0.00617 0.034398 0.014748 0.026006 0.644482 0.274195 15 Home_and_Garden/Home_and_Garden