Data Scraping for Pagination of the Products to get all products details-CodePudding

I want to scrape all the product data for the 'Cushion cover' category having URL = 'https://www.noon.com/uae-en/home-and-kitchen/home-decor/slipcovers/cushion-cover/' I analysed the data is in the script tag ,but how to get the data from all the pages. I required the URL's of all the Products from all the pages and the data is also in API for different pages API= 'https://www.noon.com/_next/data/B60DhzfamQWEpEl9Q8ajE/uae-en/home-and-kitchen/home-decor/slipcovers/cushion-cover.json?limit=50&page=2&sort[by]=popularity&sort[dir]=desc&catalog=home-and-kitchen&catalog=home-decor&catalog=slipcovers&catalog=cushion-cover'
if we goes on changing the page num in the above link we have the data for the respective pages but how to get that data from different pages Please suggest for this.

import requests
import pandas as pd
import json
import csv
from lxml import html

headers ={'authority': 'www.noon.com',
      'accept' : 
'text/html,application/xhtml xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
      'accept-encoding': 'gzip, deflate, br',
      'accept-language': 'en-US,en;q=0.9',
      }
 produrl = 'https://www.noon.com/uae-en/home-and-kitchen/home-decor/slipcovers/cushion-cover/'
 prodresp = requests.get(produrl, headers = headers, timeout =30)
 prodResphtml = html.fromstring(prodresp.text)
 print(prodresp)


 partjson = prodResphtml.xpath('//script[@id= "__NEXT_DATA__"]/text()')
 partjson = partjson[0]

CodePudding user response：

You are about to reach your goal. You can make the next pages meaning pagination using for loop and range function to pull all the pages as we know that total page numbers are 192 that's why I've made the pagination this robust way. So to get all the products url (or any data item) from all of the pages, you can follow the next example.

Script:

import requests
import pandas as pd
import json
from lxml import html

headers ={
      'authority': 'www.noon.com',
      'accept' :'text/html,application/xhtml xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
      'accept-encoding': 'gzip, deflate, br',
      'accept-language': 'en-US,en;q=0.9',
      }
produrl = 'https://www.noon.com/uae-en/home-and-kitchen/home-decor/slipcovers/cushion-cover/?limit=50&page={page}&sort[by]=popularity&sort[dir]=desc'
data=[]
for page in range(0,192):
    prodresp = requests.get(produrl.format(page=page), headers = headers, timeout =30)
    prodResphtml = html.fromstring(prodresp.text)
    #print(prodresp)

    partjson = prodResphtml.xpath('//script[@id= "__NEXT_DATA__"]/text()')
    #partjson = partjson[0]
    partjson = json.loads(partjson[0])
    #print(partjson)

    # with open('data.json','w',encoding='utf-8') as f:
    #     f.write(partjson)

    for item in partjson['props']['pageProps']['props']['catalog']['hits']:
        link='https://www.noon.com/' item['url']
        data.append(link)

df = pd.DataFrame(data,columns=['URL'])
#df.to_csv('product.csv',index=False)#to save data into your system
print(df)

Output:

                     URL
0     https://www.noon.com/graphic-geometric-pattern...
1     https://www.noon.com/classic-nordic-decorative...
2     https://www.noon.com/embroidered-iconic-medusa...
3     https://www.noon.com/geometric-marble-texture-...
4     https://www.noon.com/traditional-damask-motif-...
...                                                 ...
9594  https://www.noon.com/geometric-printed-cushion...
9595  https://www.noon.com/chinese-style-art-printed...
9596  https://www.noon.com/chinese-style-art-printed...
9597  https://www.noon.com/chinese-style-art-printed...
9598  https://www.noon.com/chinese-style-art-printed...

[9599 rows x 1 columns]

CodePudding user response：

I used re lib. In other word, I used regex it is much better to scrape any page use JavaScript

import requests
import pandas as pd
import json
import csv
from lxml import html
import re
headers ={
    'authority': 'www.noon.com',
    'accept' : 'text/html,application/xhtml xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'accept-encoding': 'gzip, deflate, br',
    'accept-language': 'en-US,en;q=0.9',
      }
url = "https://www.noon.com/uae-en/home-and-kitchen/home-decor/slipcovers/cushion-cover/"
prodresp = requests.get(url, headers = headers, timeout =30)
jsonpage = re.findall(r'type="application/json">(.*?)</script>', prodresp.text)
jsonpage = json.loads(jsonpage[0])