I want to scrape all the product data for the 'Cushion cover' category having URL = 'https://www.noon.com/uae-en/home-and-kitchen/home-decor/slipcovers/cushion-cover/'
I analysed the data is in the script tag ,but how to get the data from all the pages. I required the URL's of all the Products from all the pages and the data is also in API for different pages API= 'https://www.noon.com/_next/data/B60DhzfamQWEpEl9Q8ajE/uae-en/home-and-kitchen/home-decor/slipcovers/cushion-cover.json?limit=50&page=2&sort[by]=popularity&sort[dir]=desc&catalog=home-and-kitchen&catalog=home-decor&catalog=slipcovers&catalog=cushion-cover'
if we goes on changing the page num in the above link we have the data for the respective pages but how to get that data from different pages
Please suggest for this.
import requests
import pandas as pd
import json
import csv
from lxml import html
headers ={'authority': 'www.noon.com',
'accept' :
'text/html,application/xhtml xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9',
}
produrl = 'https://www.noon.com/uae-en/home-and-kitchen/home-decor/slipcovers/cushion-cover/'
prodresp = requests.get(produrl, headers = headers, timeout =30)
prodResphtml = html.fromstring(prodresp.text)
print(prodresp)
partjson = prodResphtml.xpath('//script[@id= "__NEXT_DATA__"]/text()')
partjson = partjson[0]
CodePudding user response:
You are about to reach your goal. You can make the next pages meaning pagination using for loop and range function
to pull all the pages as we know that total page numbers are 192 that's why I've made the pagination this robust way. So to get all the products url
(or any data item) from all of the pages, you can follow the next example.
Script:
import requests
import pandas as pd
import json
from lxml import html
headers ={
'authority': 'www.noon.com',
'accept' :'text/html,application/xhtml xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9',
}
produrl = 'https://www.noon.com/uae-en/home-and-kitchen/home-decor/slipcovers/cushion-cover/?limit=50&page={page}&sort[by]=popularity&sort[dir]=desc'
data=[]
for page in range(0,192):
prodresp = requests.get(produrl.format(page=page), headers = headers, timeout =30)
prodResphtml = html.fromstring(prodresp.text)
#print(prodresp)
partjson = prodResphtml.xpath('//script[@id= "__NEXT_DATA__"]/text()')
#partjson = partjson[0]
partjson = json.loads(partjson[0])
#print(partjson)
# with open('data.json','w',encoding='utf-8') as f:
# f.write(partjson)
for item in partjson['props']['pageProps']['props']['catalog']['hits']:
link='https://www.noon.com/' item['url']
data.append(link)
df = pd.DataFrame(data,columns=['URL'])
#df.to_csv('product.csv',index=False)#to save data into your system
print(df)
Output:
URL
0 https://www.noon.com/graphic-geometric-pattern...
1 https://www.noon.com/classic-nordic-decorative...
2 https://www.noon.com/embroidered-iconic-medusa...
3 https://www.noon.com/geometric-marble-texture-...
4 https://www.noon.com/traditional-damask-motif-...
... ...
9594 https://www.noon.com/geometric-printed-cushion...
9595 https://www.noon.com/chinese-style-art-printed...
9596 https://www.noon.com/chinese-style-art-printed...
9597 https://www.noon.com/chinese-style-art-printed...
9598 https://www.noon.com/chinese-style-art-printed...
[9599 rows x 1 columns]
CodePudding user response:
I used re lib. In other word, I used regex it is much better to scrape any page use JavaScript
import requests
import pandas as pd
import json
import csv
from lxml import html
import re
headers ={
'authority': 'www.noon.com',
'accept' : 'text/html,application/xhtml xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9',
}
url = "https://www.noon.com/uae-en/home-and-kitchen/home-decor/slipcovers/cushion-cover/"
prodresp = requests.get(url, headers = headers, timeout =30)
jsonpage = re.findall(r'type="application/json">(.*?)</script>', prodresp.text)
jsonpage = json.loads(jsonpage[0])