Home > OS >  Transform data that I scraped with BeautifulSoup
Transform data that I scraped with BeautifulSoup

Time:08-08

This is the first time I post in StockOverflow.

I am trying to scrape a website but I get a result that I dont know how to convert on dataframe in order to be readeable.

from selenium import webdriver
from selenium_stealth import stealth
from selenium.webdriver.chrome.service import Service
from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime
import requests

driver_path = 'C:\Program Files\chromedriver.exe'
driver = webdriver.Chrome(executable_path=driver_path, chrome_options=options)


stealth(driver,
        languages=["en-US", "en"],
        vendor="Google Inc.",
        platform="Win32",
        webgl_vendor="Intel Inc.",
        renderer="Intel Iris OpenGL Engine",
        fix_hairline=True,
        )

url = 'https://ikea.fr'

dic = {"test":[]}

page = requests.get(url)
soup = BeautifulSoup(url, 'html.parser')
    
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser')'''

And the result i get is this:

<html><head><meta content="light dark" name="color-scheme"/></head><body><pre style="word-wrap: break-word; white-space: pre-wrap;">{"SiteName":"manomano.fr","Description":"manomano : tous vos produits de bricolage, rénovation et jardinage au meilleur prix","TopCountryShares":[{"Value":0.9582353790247653,"Country":250},{"Value":0.01428431578726121,"Country":56},{"Value":0.00360626497244031,"Country":756},{"Value":0.001907518367836589,"Country":124},{"Value":0.0016671906973079764,"Country":638}],"Title":"manomano : achat en ligne bricolage, rénovation et jardinage","Engagments":{"BounceRate":"0.39632307436566677","Month":"07","Year":"2022","PagePerVisit":"5.013184701586454","Visits":"1.710036669747373E7","TimeOnSite":"282.97140337977"},"EstimatedMonthlyVisits":{"2022-02-01":18289643,"2022-03-01":20571776,"2022-04-01":19341861,"2022-05-01":21415927,"2022-06-01":18153351,"2022-07-01":17100366},"GlobalRank":{"Rank":2656},"CountryRank":{"Country":250,"Rank":91},"IsSmall":false,"TrafficSources":{"Social":0.00617010722152418,"Paid Referrals":0.03439823545252397,"Mail":0.014748024044393673,"Referrals":0.026006210925393708,"Search":0.6444821136549667,"Direct":0.27419530870119785},"Category":"Home_and_Garden/Home_and_Garden","CategoryRank":{"Rank":"15","Category":"Home_and_Garden/Home_and_Garden"},"LargeScreenshot":"https://site-images.similarcdn.com/image?url=manomano.fr&amp;t=1&amp;h=2a480fb8d6d2298ffef39594ad8d71d65f5dbf8cba53179589d0c69e6aa3fd67"}</pre></body></html>

Any idea how I can transform this data into something readeable (eg: dataframe?)

Thanks for your help, Sonkar

CodePudding user response:

You don't necessarily need selenium for this, it can be done with requests and pandas:

import requests
import pandas as pd

header = {'Content-Type': 'application/json', 
               'Accept': 'application/json', 
               'X-Requested-With': 'XMLHttpRequest',
          'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36'
         }
r = requests.get('https://data.similarweb.com/api/v1/data?domain=manomano.fr', headers=header)
# print(r.json())
df = pd.json_normalize(r.json())
print(df)

This will return a dataframe: you can extract other parts of the json response to transform into a dataframe, if you wish:

SiteName    Description TopCountryShares    Title   IsSmall Category    LargeScreenshot Engagments.BounceRate   Engagments.Month    Engagments.Year ... CountryRank.Country CountryRank.Rank    TrafficSources.Social   TrafficSources.Paid Referrals   TrafficSources.Mail TrafficSources.Referrals    TrafficSources.Search   TrafficSources.Direct   CategoryRank.Rank   CategoryRank.Category
0   manomano.fr manomano : tous vos produits de bricolage, rén...   [{'Value': 0.9582353790247653, 'Country': 250}...   manomano : achat en ligne bricolage, rénovatio...   False   Home_and_Garden/Home_and_Garden https://site-images.similarcdn.com/image?url=m...   0.39632307436566677 07  2022    ... 250 91  0.00617 0.034398    0.014748    0.026006    0.644482    0.274195    15  Home_and_Garden/Home_and_Garden
  • Related