I am new to scrapy and python I am scrapping data from Aliexpress.com with playwright method and it returns (referer: None): Here is my code
class AliSpider(scrapy.Spider):
name = "aliex"
def start_requests(self):
# GET request
search_value = 'phones'
yield scrapy.Request(f"https://www.aliexpress.com/premium/{search_value}.html?spm=a2g0o.productlist.1000002.0&initiative_id=SB_20230118063054&dida=y",
meta=dict(
playwright= True,
playwright_include_page = True,
playwright_page_coroutines =[
PageMethod('wait_for_selector', '.list--gallery--34TropR')
]
))
async def parse(self, response):
for data in response.xpath("//h1"):
related_link = data.xpath(".//text()").get()
yield{
'related_link':related_link
}
I am getting
2023-01-18 19:56:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.aliexpress.com/wholesale?SearchText=phones&spm=a2g0o.productlist.1000002.0&initiative_id=SB_20230118063054&dida=y> (referer: None)
2023-01-18 19:56:55 [scrapy.core.engine] INFO: Closing spider (finished)
I tried with both xpath and css selector but results same. Anyone can help me please
CodePudding user response:
Here is the complete solution using standalone playwright with python which works find with windows.The website loaded data dynamicaly via JavaScript that's why I use page.evaluate() method to execute JavaScript and scroll the entire page, otherwise, it will not scrape the complete ResultSets.
Script:
from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup
import pandas as pd
import time
data = []
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
context = browser.new_context()
page = context.new_page()
search_value = 'phones'
for page_num in range(1,4):
page.goto(f"https://www.aliexpress.com/wholesale?SearchText=phones&catId=0&dida=y&g=y&initiative_id=SB_20230118063054&page={page_num}&spm=a2g0o.productlist.1000002.0&trafficChannel=main")
page.wait_for_selector('[]',timeout=30000)
scroll_height = page.evaluate("""() => {
return Math.max(
document.body.scrollHeight, document.documentElement.scrollHeight,
document.body.offsetHeight, document.documentElement.offsetHeight,
document.body.clientHeight, document.documentElement.clientHeight
);
}""")
current_height = 0
while current_height < scroll_height:
current_height = page.evaluate("""() => {
window.scrollBy(0, window.innerHeight);
return window.scrollY;
}""")
time.sleep(2)
soup = BeautifulSoup(page.content(), 'lxml')
for card in soup.select('[]'):
title = card.h1.text
data.append({'title':title})
df = pd.DataFrame(data)
print(df)
Output:
title
0 Unlock Samsung Galaxy S10 S10 s10e G970U G973...
1 SERVO K07 Plus mini Mobile Phone Pen Dual SIM ...
2 BLACKVIEW OSCAL C80 Smartphone 6.5" Waterdrop ...
3 Original Apple iPhone 7 Unlocked 99% New Mobil...
4 [World Premiere] Blackview BV9200 Rugged Smart...
.. ...
175 Motorola StarTAC Rainbow 500mAh Fashion 90% Ne...
176 Original International Version HuaWei P30 Pro ...
177 Unlocked Original Apple iPhone SE Dual Core 2G...
178 2022 Unihertz TANK Large Battery Rugged Smartp...
179 75W Car Wireless Charger Car Mount Phone Holde...
[180 rows x 1 columns]