Home > Blockchain >  How to scrape specific information on a website
How to scrape specific information on a website

Time:04-08

Here's my script :

import re
import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

URLs = ['https://www.iwc.com/fr/fr/watch-collections/pilot-watches/iw329303-big-pilots-watch-43.html']

Marques = []
Brands = []
Refs = []
Prices = []
#Carts = []
#Links = []
References = []
Links = []

for url in URLs:

    results = requests.get(url)
    soup = BeautifulSoup(results.text, "html.parser")


    Marques.append('IWC')

    Brand = soup.find('span', class_ = 'iwc-buying-options-title').text
    Brand = str(Brand)
    Brand = re.sub("Ajouter à la liste de souhaits", '', Brand)

    Brand = re.sub("\n", '', Brand)
    Brands.append(Brand)

    Price.append(soup.find('div', class_ = 'iwc-buying-options-price').get_text(strip=True))

    Links.append(url)

    References.append(soup.find('h1', class_ = 'iwc-buying-options-reference').text)

print(Brand)
print(Price)
print(Links)
print(References)

Unfortunately, Brand give me that : [" Grande Montre d'Aviateur\xa043 "]

References give me that : ['\n IW329303\n ']

And Price give me nothing, I think it's bcause it's not some sort of text as you can see :

print(soup.find('div', class_ = 'iwc-buying-options-price')
<div ></div>

Any ideas how to do that ?

I would like this output :

outputdesired

CodePudding user response:

You'll want to use .strip() to get rid of that white space:

so for example you want Brand = soup.find('span', class_ = 'iwc-buying-options-title').text.strip()

Price unfortuntly not as easy. The page is dynamic meaning that html tag does not have the price/content in the static request. It is though in the form of json in another tag:

import re
import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import json

URLs = ['https://www.iwc.com/fr/fr/watch-collections/pilot-watches/iw329303-big-pilots-watch-43.html']

Marques = []
Brands = []
Refs = []
Prices = []
#Carts = []
#Links = []
References = []
Links = []

for url in URLs:

    results = requests.get(url)
    soup = BeautifulSoup(results.text, "html.parser")


    Marques.append('IWC')

    Brand = soup.find('span', class_ = 'iwc-buying-options-title').text.strip()
    Brand = str(Brand)
    Brand = re.sub("Ajouter à la liste de souhaits", '', Brand)

    Brand = re.sub("\n", '', Brand)
    Brands.append(Brand)

    price = json.loads(soup.find_all('button', {'type':'submit'})[-1]['data-tracking-products'])[0]['price']
    Prices.append(price)
    Links.append(url)

    References.append(soup.find('h1', class_ = 'iwc-buying-options-reference').text.strip())

print(Brand)
print(Prices)
print(Links)
print(References)

Output:

Grande Montre d'Aviateur 43                                        
['9100.00']
['https://www.iwc.com/fr/fr/watch-collections/pilot-watches/iw329303-big-pilots-watch-43.html']
['IW329303']
  • Related