I'm a beginner in webscraping using python - however I need to use it frequently.
I'm trying to webscrape e-shop for mobiles to get item name & price.
website: https://shop.orange.eg/en/mobiles-and-devices?IsMobile=false
My code "using User-agent" technique is as below:
import json
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://shop.orange.eg/en/mobiles-and-devices?IsMobile=false'
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36"}
web_page = requests.get(url,headers=headers)
soup = BeautifulSoup(web_page.content, "html.parser")
product_list = soup.find_all('div', class_='col-md-6 col-lg-4 mb-4')
product_list
output: [] -> empty lists
I'm not sure I'm doing right, also when i look at page source-code, I find no information.
CodePudding user response:
That page is being loaded initially, then further hydrated from an api (with html). This is one way to get those products sold by Orange Egypt:
from bs4 import BeautifulSoup as bs
import requests
from tqdm import tqdm ## if using jupyter notebook, import as: from tqdm.notebook import tqdm
import pandas as pd
headers = {
'X-Requested-With': 'XMLHttpRequest',
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"
}
s = requests.Session()
s.headers.update(headers)
big_list = []
for x in tqdm(range(1, 16)):
url = f'https://shop.orange.eg/en/catalog/ListCategoryProducts?IsMobile=false&pagenumber={x}&categoryId=24'
r = s.get(url)
soup = bs(r.text, 'html.parser')
devices = soup.select('[class^="card device-card"]')
for d in devices:
product_title = d.select_one('h4[class^="card-title"] a[name="ancProduct"]').get('title')
product_price = d.select_one('h4[class^="card-title"] a[name="ancProduct"]').get('data-gtm-click-price')
product_link = d.select_one('h4[class^="card-title"] a[name="ancProduct"]').get('href')
big_list.append((product_title, product_price, product_link))
df = pd.DataFrame(big_list, columns = ['Product', 'Price', 'Url'])
print(df)
Result:
Product Price Url
0 Samsung Galaxy Z Fold4 5G 46690.0000 //shop.orange.eg/en/mobiles/samsung-mobiles/samsung-galaxy-z-fold4-5g
1 ASUS Vivobook Flip 14 9999.0000 //shop.orange.eg/en/devices/tablets-and-laptops/asus-vivobook-flip-14
2 Acer Aspire 3 A315-56 7299.0000 //shop.orange.eg/en/devices/tablets-and-laptops/acer-aspire-3-a315-56
3 Lenovo IdeaPad 3 15IGL05 5777.0000 //shop.orange.eg/en/devices/tablets-and-laptops/lenovo-tablets/lenovo-ideapad-3-15igl05
4 Lenovo IdeaPad Flex 5 16199.0000 //shop.orange.eg/en/devices/tablets-and-laptops/lenovo-tablets/lenovo-ideapad-flex-5
... ... ... ...
171 Eufy P1 Scale Wireless Smart Digital 699.0000 //shop.orange.eg/en/devices/accessories/scale-wireless/eufy-p1-scale-wireless-smart-digital
172 Samsung Smart TV 50AU7000 9225.0000 //shop.orange.eg/en/devices/smart-tv/samsung-tv-50tu7000
173 Samsung Smart TV 43T5300 6999.0000 //shop.orange.eg/en/devices/smart-tv/samsung-tv-43t5300
174 Samsung Galaxy A22 4460.0000 //shop.orange.eg/en/mobiles/samsung-mobiles/samsung-galaxy-a22
175 Eufy eufycam 2 2 plus 1 kit 4999.0000 //shop.orange.eg/en/devices/accessories/camera-wireless/eufy-eufycam-2-2-plus-1-kit
176 rows × 3 columns
For TQDM visit https://pypi.org/project/tqdm/
For Requests documentation, see https://requests.readthedocs.io/en/latest/
Also for pandas: https://pandas.pydata.org/pandas-docs/stable/index.html
And for BeautifulSoup: https://beautiful-soup-4.readthedocs.io/en/latest/index.html
CodePudding user response:
The webpage is loaded dynamically from external source via AJAX
. So you have to use API url instead.
Example:
import time
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://shop.orange.eg/en/mobiles-and-devices?IsMobile=false'
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36"}
ajax_url = 'https://shop.orange.eg/en/catalog/ListCategoryProducts'
params = {
'IsMobile':'false',
'pagenumber': '2',
'categoryId': '24'
}
for params['pagenumber'] in range(1,2):
web_page = requests.get(ajax_url,headers=headers,params=params)
time.sleep(5)
soup = BeautifulSoup(web_page.content, "html.parser")
product_list = soup.find_all('div', class_='col-md-6 col-lg-4 mb-4')
for product in product_list:
title=product.h4.get_text(strip=True)
print(title)
Output:
Samsung MobilesSamsung Galaxy Z Fold4 5G
Tablets and LaptopsASUS Vivobook Flip 14
Tablets and LaptopsAcer Aspire 3 A315-56
Lenovo TabletsLenovo IdeaPad 3 15IGL05
Lenovo TabletsLenovo IdeaPad Flex 5
Samsung MobilesSamsung Galaxy S22 Ultra 5G
WearablesApple Watch Series 7
Samsung MobilesSamsung Galaxy Note 20 Ultra
GamingLenovo IdeaPad Gaming 3
Tablets and LaptopsSamsung Galaxy Tab S8 5G
Wireless ChargerLanex Charger Wireless Magnetic 3-in-1 15W
AccessoriesAnker Sound core R100