so I want to scrape a website but the current problem Iam facing is that whenever I try to print out the element Iam scrapping it just returns an empty list. I know that the problem is that the parser can't find the class in the hmtl code. I tried all the parsers which are supported with Beautifoulsoup4 => 'lxmL' and 'hmtl5lib' but it still doesn't work. I even tried downgrading the version from 4.11.0 => 4.9.3, still doesn't work. Any ideas?
import requests
import random
from bs4 import BeautifulSoup
products = {
1: "Hoodies",
2: "Sunglasses",
3: "Couple-T-shirts",
4: "Wall-Stickers",
5: "Rugs",
6: "Dog-Bed",
7: "Claw-Cutter",
8: "Fur-Remover",
9: "Led-Keyboard",
10: "Wireless-Chargers",
11: "Powerbank",
12: "Game-Controller",
13: "Portable-Speakers",
14: "Scalp-Massager",
15: "Blackhead-Remover",
16: "Lash-Products",
17: "Makeup-Kit",
18: "Air-Tag-Tracker",
19: "Air-Purifiers",
20: "Pixelart",
21: "Yoga-Mats",
22: "Face-Masks",
23: "Fitness-Watches",
24: "Resistance-Bands",
25: "Air-Purifiers",
26: "Cell-Phone-Mounts",
27: "Wireless-Security-Cameras",
28: "Massage-Tools",
29: "Air-Purifiers",
30: "Eyeliner-Pencil",
31: "Water-Filters",
32: "Slow-Feeder-Dog-Bowls",
33: "Video-Doorbells",
34: "Solar-Outdoor-Lights",
35: "Phone-Grip",
36: "Slow-Feeder-Dog-Bowls",
37: "Pajamas",
38: "Skin-Care-Oil",
39: "Flasks",
40: "Monitor-Holders",
41: "Watches",
42: "Rings",
43: "Monitor-Holders",
44: "Nail-Polish",
45: "Rice-Cooker",
}
i = random.randint(1, 45)
url = f'https://www.aliexpress.com/af/{products[i]}.html?spm=a2g0o.productlist.10000020initiative_id=SB_20230106091400&dida=y&origin=n'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html5lib')
soup.prettify()
product = soup.find_all('a', {'class': 'manhattan--container--1lP57Ag cards--gallery--2o6yJVt'})
productimage = soup.find_all('img', {'class': 'manhattan--img--36QXbtQ product-img'})
productprice = soup.find_all('div', {'class': 'manhattan--price-sale--1CCSZfK'})
productrating = soup.find_all('span', {'class': 'manhattan--evaluation--3cSMntr'})
print(soup)
productlinkstring = product[0]['href']
productlinkstring = 'https://www' productlinkstring[4:]
productimagelinkstring = productimage[0]['src']
productimagelinkstring = 'https://' productimagelinkstring[2:]
productpricestring = productprice[0].text
productratingsting = productrating[0].text
producttitle = products[i]
producttitlestring = producttitle.replace('-', ' ')
productsstring = products[i]
productsstringright = productsstring.replace('-', ' ')
endpoint = f'https://api.datamuse.com/words?ml={productsstringright}&max=18'
response = requests.get(endpoint)
data = response.json()
print(productlinkstring)
print(productimagelinkstring)
CodePudding user response:
I don't think there's anything wrong with your parser.
Some notes about my tests:
i = 19
when I ran my tests- so
url
was formed as https://www.aliexpress.com/af/Air-Purifiers.html?spm=a2g0o.productlist.10000020initiative_id=SB_20230106091400&dida=y&origin=n - but when I ran
print(f'{page.status_code} {page.reason} from {page.url}')
to check the response, the output indicated a redirect somewhere - I also ran
with open('x.html', 'wb') as f: f.write(page.content)
, but when I open "x.html" on my browser, it looked like this even though the products show up (with the classes used in your code too) when I rendered it withIPython.display.HTML
This indicates that the elements you're targeting are rendered with JavaScript, and that's why they're not found.
The data you want seems to be in one of the script
tags though. I could target that tag with
scriptTag = soup.select_one('meta[name="aplus-auto-exp"]~script')
and if any other parser was not used, the JavaScript code could probably be gotten simply with jScript = scriptTag.get_text()
, but with html5lib
parser I had to stringify and strip the tag name:
jScript = scriptTag.prettify().strip()
jScript = jScript.strip('<script').split('>', 1)[-1].strip('</script>').strip()
And then I extracted the list of products with findObj_inJS
, which uses slimit to parse JavaScript:
itemList = findObj_inJS(jScript, '"itemList"')['content']
itemList
contained 60 items. The first 2 items are prettified as JSON below:
[
{
"itemType": "productV3",
"productType": "natural",
"nativeCardType": "nt_srp_cell_g",
"itemCardType": "manhattan",
"productId": "3256804896450950",
"lunchTime": "2022-12-20 00:00:00",
"image": {
"imgUrl": "//ae01.alicdn.com/kf/S39f71cffe79a4a6981083f57128103c9d/Anti-Gravity-Humidifier-Diffuser-Water-Drop-Falling-Remote-Control-Mini-Mist-Maker-Humidifier-Air-Purifiers-Droplets.jpg_220x220xz.jpg",
"imgWidth": 220,
"imgHeight": 220,
"imgType": "0"
},
"title": {
"seoTitle": "Anti Gravity Humidifier Diffuser Water Drop Falling Remote Control Mini Mist Maker Humidifier Air Purifiers Droplets Upflow USB",
"displayTitle": "Anti Gravity Humidifier Diffuser Water Drop Falling Remote Control Mini Mist Maker Humidifier Air Purifiers Droplets Upflow USB",
"shortTitle": false
},
"prices": {
"skuId": "12000031580241123",
"pricesStyle": "default",
"builderType": "skuCoupon",
"currencySymbol": "US $",
"prefix": "Sale price:",
"salePrice": {
"discount": -1,
"minPriceDiscount": 50,
"priceType": "sale_price",
"currencyCode": "USD",
"minPrice": 26,
"minPriceType": 2,
"formattedPrice": "US $26.00"
},
"taxRate": "0"
},
"sellingPoints": [
{
"sellingPointTagId": "m0000063",
"tagStyleType": "default",
"tagContent": {
"displayTagType": "text",
"tagText": "Extra 1% off with coins",
"tagStyle": {
"color": "#FD384F",
"position": "2"
}
},
"source": "flexiCoin_new_atm"
},
{
"sellingPointTagId": "m0000064",
"tagStyleType": "default",
"tagContent": {
"displayTagType": "text",
"tagText": "Free Shipping",
"tagStyle": {
"color": "#009966",
"position": "4"
}
},
"source": "Free_Shipping_atm"
}
],
"store": {
"storeId": 1102377281,
"aliMemberId": 2668032633,
"storeName": "Yousmile Life Store",
"storeUrl": "//www.aliexpress.com/store/1102377281"
},
"trace": {
"pdpParams": {
"pdp_cdi": "{"traceId":"212243c016730635978473504d06bf","itemId":"3256804896450950","fromPage":"search","skuId":"12000031580241123","shipFrom":"US","order":"0","star":"","freeShip":"true"}",
"pdp_npi": "2@dis!USD!52.0!26.0!!!!!@212243c016730635978473504d06bf!12000031580241123!sea",
"pdp_perf": "main_img=//ae01.alicdn.com/kf/S39f71cffe79a4a6981083f57128103c9d.jpg",
"pdp_ext_f": "{"sku_id":"12000031580241123"}"
},
"exposure": {
"displayCategoryId": "",
"postCategoryId": "625",
"selling_point": "m0000063,m0000064",
"algo_exp_id": "77f9a225-26b3-4384-990c-3f411893324d-0"
},
"click": {
"algo_pvid": "77f9a225-26b3-4384-990c-3f411893324d",
"haveSellingPoint": "true"
},
"detailPage": {
"algo_pvid": "77f9a225-26b3-4384-990c-3f411893324d",
"algo_exp_id": "77f9a225-26b3-4384-990c-3f411893324d-0"
},
"custom": {},
"utLogMap": {
"original_price_type": "offer",
"formatted_price": "US $26.00",
"csp": "26.0,1",
"x_object_type": "productV3",
"algo_pvid": "77f9a225-26b3-4384-990c-3f411893324d",
"hit_19_forbidden": false,
"is_detail_next": "1",
"model_ctr": 0.06788578629493713,
"sku_id": "12000031580241123",
"mixrank_success": "false",
"custom_group": 3,
"sku_ic_tags": "[]",
"is_adult_certified": false,
"mixrank_enable": "false",
"ump_atmospheres": "none",
"oip": "52.0,0",
"selling_point": "m0000063,m0000064",
"original_price_strategy": "sku_opt",
"x_object_id": "1005005082765702"
}
}
},
{
"itemType": "productV3",
"productType": "natural",
"nativeCardType": "nt_srp_cell_g",
"itemCardType": "manhattan",
"productId": "3256803823899352",
"lunchTime": "2022-03-09 00:00:00",
"image": {
"imgUrl": "//ae01.alicdn.com/kf/S5d14059b4d364bf5a41d2c08a5d6e735y/Portable-Air-Purifier-Anion-Air-Purification-Xiomi-Air-Freshener-Ionizer-Cleaner-Dust-Cigarette-Smoke-Remover-Toilet.jpg_220x220xz.jpg",
"imgWidth": 220,
"imgHeight": 220,
"imgType": "0"
},
"title": {
"seoTitle": "Portable Air Purifier Anion Air Purification Xiomi Air Freshener Ionizer Cleaner Dust Cigarette Smoke Remover Toilet Deodorant",
"displayTitle": "Portable Air Purifier Anion Air Purification Xiomi Air Freshener Ionizer Cleaner Dust Cigarette Smoke Remover Toilet Deodorant",
"shortTitle": false
},
"prices": {
"skuId": "12000027730379754",
"pricesStyle": "default",
"builderType": "skuCoupon",
"currencySymbol": "US $",
"prefix": "Sale price:",
"originalPrice": {
"priceType": "original_price",
"currencyCode": "USD",
"minPrice": 20.8,
"minPriceType": 1,
"formattedPrice": "US $20.80"
},
"salePrice": {
"discount": -1,
"minPriceDiscount": 95,
"priceType": "sale_price",
"currencyCode": "USD",
"minPrice": 0.99,
"minPriceType": 2,
"formattedPrice": "US $0.99"
},
"taxRate": "0"
},
"sellingPoints": [
{
"sellingPointTagId": "m0000040",
"tagStyleType": "default",
"tagContent": {
"displayTagType": "image",
"tagImgUrl": "https://ae01.alicdn.com/kf/S8cbad032762b405b8ebd8f30bca4bc83u/338x64.png",
"tagImgWidth": 338,
"tagImgHeight": 64,
"tagStyle": {
"color": "#FD384F",
"position": "1"
}
},
"source": "new_user_platform_allowance_atm"
},
{
"sellingPointTagId": "m0000064",
"tagStyleType": "default",
"tagContent": {
"displayTagType": "text",
"tagText": "Free Shipping",
"tagStyle": {
"color": "#009966",
"position": "4"
}
},
"source": "Free_Shipping_atm"
},
{
"sellingPointTagId": "1000013764",
"tagStyleType": "default",
"tagContent": {
"displayTagType": "text",
"tagText": "Free Return",
"tagStyle": {
"color": "#009966",
"position": "4"
}
},
"source": "Free_return_atm"
}
],
"evaluation": {
"starRating": 4.4,
"starUrl": "https://ae01.alicdn.com/kf/S567d6bf538214abf95c1e5825c7e6a05t/48x48.png",
"starWidth": 48,
"starHeight": 48
},
"trade": {
"tradeDesc": "1340 sold"
},
"store": {
"storeId": 1101982761,
"aliMemberId": 2657161543,
"storeName": "BOYALIGE Official Store",
"storeUrl": "//www.aliexpress.com/store/1101982761"
},
"trace": {
"pdpParams": {
"pdp_cdi": "{"traceId":"212243c016730635978473504d06bf","itemId":"3256803823899352","fromPage":"search","skuId":"12000027730379754","shipFrom":"CN","order":"1340","star":"4.4","freeShip":"true","shipSellingPoint":"freeReturn"}",
"pdp_npi": "2@dis!USD!20.8!0.99!!!!!@212243c016730635978473504d06bf!12000027730379754!sea",
"pdp_perf": "main_img=//ae01.alicdn.com/kf/S5d14059b4d364bf5a41d2c08a5d6e735y.jpg",
"pdp_ext_f": "{"sku_id":"12000027730379754"}"
},
"exposure": {
"displayCategoryId": "",
"postCategoryId": "613",
"selling_point": "m0000040,m0000064,1000013764",
"algo_exp_id": "77f9a225-26b3-4384-990c-3f411893324d-1"
},
"click": {
"algo_pvid": "77f9a225-26b3-4384-990c-3f411893324d",
"haveSellingPoint": "true"
},
"detailPage": {
"algo_pvid": "77f9a225-26b3-4384-990c-3f411893324d",
"algo_exp_id": "77f9a225-26b3-4384-990c-3f411893324d-1"
},
"custom": {},
"utLogMap": {
"original_price_type": "offer",
"formatted_price": "US $0.99",
"csp": "0.99,1",
"x_object_type": "productV3",
"algo_pvid": "77f9a225-26b3-4384-990c-3f411893324d",
"hit_19_forbidden": false,
"is_detail_next": "1",
"model_ctr": 0.21320748329162598,
"sku_id": "12000027730379754",
"mixrank_success": "false",
"custom_group": 3,
"sku_ic_tags": "[]",
"is_adult_certified": false,
"mixrank_enable": "false",
"ump_atmospheres": "new_user_platform_allowance,none",
"oip": "20.8,1",
"selling_point": "m0000040,m0000064,1000013764",
"original_price_strategy": "sku_opt",
"x_object_id": "1005004010214104"
}
},
"config": {
"prices": {
"color": "#FD384F"
}
}
}
]
You could write a function to reduce it to only the information you want, something like:
def reduceProductInfo(origInf):
try: prodId = origInf['productId']
except: return {'errorMsg': 'productId expected', 'orig': origInf}
pDets = {
'id': prodId, 'link': f'https://www.aliexpress.com/item/{prodId}.html'
}
try: pDets['title'] = origInf["title"]["displayTitle"]
except Exception as e: pDets['title'] = f'!{type(e)} - {e}!'
try: pDets['salePrice'] = origInf["prices"]["salePrice"]["formattedPrice"]
except Exception as e: pDets['salePrice'] = f'!{type(e)} - {e}!'
try: pDets['imageLink'] = origInf["image"]["imgUrl"]
except Exception as e: pDets['imageLink'] = f'!{type(e)} - {e}!'
if pDets['imageLink'][:2] == '//':
pDets['imageLink'] = f"https:{pDets['imageLink']}"
return pDets
and then just reduce the whole list with
productList = [reduceProductInfo(item) for item in itemList]
the first 5 items of productList
would look like this.
If you want to go with this approach, you might want to check out my getNestedVal
function as well - it's rather useful for quickly figuring out the sequence of keys and indices needed to get a nested value.
CodePudding user response:
I don't know what exactly is your goal in scraping, but have you tried
soup = BeautifulSoup(page.content, 'html.parser')
After that, your code works perfectly for me and returns good urls at the end of the script
https://www.aliexpress.com/item/1005004643645808.html?algo_pvid=36f2290b-a7f6-4a43-8d4e-cb427a255b22&algo_exp_id=36f2290b-a7f6-4a43-8d4e-cb427a255b22-0&pdp_ext_f={"sku_id":"12000030390398106"}&pdp_npi=2@dis!EUR!3.8!1.79!!!!!@2100b20d16730874440116525d06ef!12000030390398106!sea&curPageLogUid=TO7uEnR4RcFE
https://ae01.alicdn.com/kf/Sba697dac347c425bbbc16a0f71d8a534W/Mini-Tracking-Device-Tracking-Air-Tag-Key-Child-Finder-Pet-Tracker-Location-Smart-Bluetooth-Tracker-Car.jpg_220x220xz.jpg