Home > other >  Webscrape e-commerce
Webscrape e-commerce

Time:10-03

I'm a beginner in webscraping using python - however I need to use it frequently.

I'm trying to webscrape e-shop for mobiles to get item name & price.

website: https://shop.orange.eg/en/mobiles-and-devices?IsMobile=false

My code "using User-agent" technique is as below:

import json
import requests
import pandas as pd
from bs4 import BeautifulSoup

url = 'https://shop.orange.eg/en/mobiles-and-devices?IsMobile=false'
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36"}

web_page = requests.get(url,headers=headers)
soup = BeautifulSoup(web_page.content, "html.parser")

product_list = soup.find_all('div', class_='col-md-6 col-lg-4 mb-4')

product_list

output: [] -> empty lists

I'm not sure I'm doing right, also when i look at page source-code, I find no information.

CodePudding user response:

That page is being loaded initially, then further hydrated from an api (with html). This is one way to get those products sold by Orange Egypt:

from bs4 import BeautifulSoup as bs
import requests
from tqdm import tqdm ## if using jupyter notebook, import as: from tqdm.notebook import tqdm
import pandas as pd

headers = {
    'X-Requested-With': 'XMLHttpRequest',
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"
    }

s = requests.Session()
s.headers.update(headers)
big_list = []
for x in tqdm(range(1, 16)):
    url = f'https://shop.orange.eg/en/catalog/ListCategoryProducts?IsMobile=false&pagenumber={x}&categoryId=24'
    r = s.get(url)
    soup = bs(r.text, 'html.parser')
    devices = soup.select('[class^="card device-card"]')
    for d in devices:
        product_title = d.select_one('h4[class^="card-title"] a[name="ancProduct"]').get('title')
        product_price = d.select_one('h4[class^="card-title"] a[name="ancProduct"]').get('data-gtm-click-price')
        product_link = d.select_one('h4[class^="card-title"] a[name="ancProduct"]').get('href')
        big_list.append((product_title, product_price, product_link))
df = pd.DataFrame(big_list, columns = ['Product', 'Price', 'Url'])
print(df)

Result:

    Product Price   Url
0   Samsung Galaxy Z Fold4 5G   46690.0000  //shop.orange.eg/en/mobiles/samsung-mobiles/samsung-galaxy-z-fold4-5g
1   ASUS Vivobook Flip 14   9999.0000   //shop.orange.eg/en/devices/tablets-and-laptops/asus-vivobook-flip-14
2   Acer Aspire 3 A315-56   7299.0000   //shop.orange.eg/en/devices/tablets-and-laptops/acer-aspire-3-a315-56
3   Lenovo IdeaPad 3 15IGL05    5777.0000   //shop.orange.eg/en/devices/tablets-and-laptops/lenovo-tablets/lenovo-ideapad-3-15igl05
4   Lenovo IdeaPad Flex 5   16199.0000  //shop.orange.eg/en/devices/tablets-and-laptops/lenovo-tablets/lenovo-ideapad-flex-5
... ... ... ...
171 Eufy P1 Scale Wireless Smart Digital    699.0000    //shop.orange.eg/en/devices/accessories/scale-wireless/eufy-p1-scale-wireless-smart-digital
172 Samsung Smart TV 50AU7000   9225.0000   //shop.orange.eg/en/devices/smart-tv/samsung-tv-50tu7000
173 Samsung Smart TV 43T5300    6999.0000   //shop.orange.eg/en/devices/smart-tv/samsung-tv-43t5300
174 Samsung Galaxy A22  4460.0000   //shop.orange.eg/en/mobiles/samsung-mobiles/samsung-galaxy-a22
175 Eufy eufycam 2 2 plus 1 kit 4999.0000   //shop.orange.eg/en/devices/accessories/camera-wireless/eufy-eufycam-2-2-plus-1-kit
176 rows × 3 columns

For TQDM visit https://pypi.org/project/tqdm/

For Requests documentation, see https://requests.readthedocs.io/en/latest/

Also for pandas: https://pandas.pydata.org/pandas-docs/stable/index.html

And for BeautifulSoup: https://beautiful-soup-4.readthedocs.io/en/latest/index.html

CodePudding user response:

The webpage is loaded dynamically from external source via AJAX . So you have to use API url instead.

Example:

import time
import requests
import pandas as pd
from bs4 import BeautifulSoup

url = 'https://shop.orange.eg/en/mobiles-and-devices?IsMobile=false'
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36"}
ajax_url = 'https://shop.orange.eg/en/catalog/ListCategoryProducts'
params = {
    'IsMobile':'false',
    'pagenumber': '2',
    'categoryId': '24'
}

for params['pagenumber'] in range(1,2):
    web_page = requests.get(ajax_url,headers=headers,params=params)
    time.sleep(5)
    soup = BeautifulSoup(web_page.content, "html.parser")

    product_list = soup.find_all('div', class_='col-md-6 col-lg-4 mb-4')

    for product in product_list:
        title=product.h4.get_text(strip=True)
        print(title)

Output:

Samsung MobilesSamsung Galaxy Z Fold4 5G
Tablets and LaptopsASUS Vivobook Flip 14
Tablets and LaptopsAcer Aspire 3 A315-56
Lenovo TabletsLenovo IdeaPad 3 15IGL05
Lenovo TabletsLenovo IdeaPad Flex 5
Samsung MobilesSamsung Galaxy S22 Ultra 5G
WearablesApple Watch Series 7
Samsung MobilesSamsung Galaxy Note 20 Ultra
GamingLenovo IdeaPad Gaming 3
Tablets and LaptopsSamsung Galaxy Tab S8 5G
Wireless ChargerLanex Charger Wireless Magnetic 3-in-1 15W
AccessoriesAnker Sound core R100
  • Related