How to get to the next page while scraping if the link stays the same?-CodePudding

Im recently studying about web scraping, and i got stucked. I need to scrap the data from the next page, but there is only a clickable button, and link stays the same. So my problem is how can i extract link to the next page if the url stays same? The web Im scraping is http://esg.krx.co.kr/contents/02/02020000/ESG02020000.jsp

My code so far :

import scrapy
import json

class EsgKrx1Spider(scrapy.Spider):
name = 'esg_krx1'
allowed_domains = ['esg.krx.co.kr/contents/02/02020000/ESG02020000.jsp']
start_urls = ['http://esg.krx.co.kr/contents/02/02020000/ESG02020000.jsp/']

def start_requests(self):
    #sending a post request to the web
    return [scrapy.FormRequest("http://esg.krx.co.kr/contents/99/ESG99000001.jspx",
                               formdata={'sch_com_nm': '',
                                         'sch_yy': '2021',
                                         'pagePath': '/contents/02/02020000/ESG02020000.jsp',
                                         'code': '02/02020000/esg02020000',
                                         'pageFirstCall': 'Y'},
                               callback=self.parse)]

def parse(self, response):
    dict_data = json.loads(response.text)

    #looping in the result and assigning the company name
    for i in dict_data['result']:
        company_name = i['com_abbrv']
        compay_share_id = i['isu_cd']
        print(company_name, compay_share_id)

So now i need got informations only from the first page. Now i have to move to the next page. Could someone explain me please how do I do this?

CodePudding user response：

The website you are scraping exposes an API that you can call directly instead of using splash. If you examine the network tab you will see the POST request being sent to the server.

See below sample code. I have hardcoded the total number of pages but you can find an automated way of getting the total instead of hard coding the value.

Note the use of response.follow. It takes care of cookies and other headers automatically.

import scrapy

class EsgKrx1Spider(scrapy.Spider):
    name = 'esg_krx1'
    allowed_domains = ['esg.krx.co.kr']
    start_urls = ['http://esg.krx.co.kr/contents/02/02020000/ESG02020000.jsp']
    custom_settings = {
        "USER_AGENT": 'Mozilla/5.0 (X11; Linux x86_64; rv:97.0) Gecko/20100101 Firefox/97.0'
    }

    def parse(self, response):
        #send a post request to the api
        url = "http://esg.krx.co.kr/contents/99/ESG99000001.jspx"
        
        headers = {
        'Accept': 'application/json, text/javascript, */*; q=0.01',
        'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
        }

        total_pages = 77
        for page in range(total_pages):
            payload = f"sch_com_nm=&sch_yy=2021&pagePath=/contents/02/02020000/ESG02020000.jsp&code=02/02020000/esg02020000&curPage={page 1}"
            yield response.follow(url=url, method='POST', callback=self.parse_result, headers=headers, body=payload)

    def parse_result(self, response):

        # #looping in the result and assigning the company name
        for item in response.json().get('result'):
            yield {
                'company_name': item.get('com_abbrv'),
                'compay_share_id': item.get('isu_cd')
            }

CodePudding user response：

I find it easier to integrate scrapy_splash with javascript heavy websites like the one that you're using as they usually take a while to load when sending a request. Therefore, I have created a simple lua script to load the site and then parse the required information.

You'll find that the payload includes the current page that you're in; by iterating this number until the last page on the site then you can grab the next pages.

Because websites like these will block you quickly it's very important that you add timers and download-delays so that they cannot block you.

Here's a working scraper:

import scrapy
from scrapy_splash import SplashRequest
import json

script = """
function main(splash, args)
  assert(splash:go(args.url))
  assert(splash:wait(7))
  return splash:html()
end
"""
class KorenSiteSpider(scrapy.Spider):
    name = 'k-site'
    start_urls = ['https://esg.krx.co.kr/contents/02/02020000/ESG02020000.jsp']
    custom_settings = {
        'USER_AGENT':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36',
        'DOWNLOAD_DELAY':3
    }

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(
                url = url,
                callback = self.parse, 
                endpoint='execute',
                args = {'lua_source':script}
            )

    def parse(self, response):
        for i in range(1, 78, 1):
            yield scrapy.FormRequest(
                url = 'https://esg.krx.co.kr/contents/99/ESG99000001.jspx',
                method = 'POST',
                formdata = {
                            'sch_com_nm': '',
                            'sch_yy': '2021',
                            'pagePath': '/contents/02/02020000/ESG02020000.jsp',
                            'code': '02/02020000/esg02020000',
                            'curPage': str(i)
                            },
                callback = self.parse_json
            )

    def parse_json(self, response):
        dict_data = json.loads(response.text)

    #looping in the result and assigning the company name
        for i in dict_data['result']:
            company_name = i['com_abbrv']
            company_share_id = i['isu_cd']
            yield {
                'company:name':company_name,
                'company_share_id':company_share_id
            }

The output:

2022-02-17 13:55:04 [scrapy.core.scraper] DEBUG: Scraped from <200 https://esg.krx.co.kr/contents/99/ESG99000001.jspx>
{'company:name': '페이퍼코리아', 'company_share_id': '001020'}
2022-02-17 13:55:04 [scrapy.core.scraper] DEBUG: Scraped from <200 https://esg.krx.co.kr/contents/99/ESG99000001.jspx>
{'company:name': '평화산업', 'company_share_id': '090080'}
2022-02-17 13:55:04 [scrapy.core.scraper] DEBUG: Scraped from <200 https://esg.krx.co.kr/contents/99/ESG99000001.jspx>
{'company:name': '평화홀딩스', 'company_share_id': '010770'}
2022-02-17 13:55:04 [scrapy.core.scraper] DEBUG: Scraped from <200 https://esg.krx.co.kr/contents/99/ESG99000001.jspx>
{'company:name': '포스코', 'company_share_id': '005490'}
2022-02-17 13:55:04 [scrapy.core.scraper] DEBUG: Scraped from <200 https://esg.krx.co.kr/contents/99/ESG99000001.jspx>
{'company:name': '포스코강판', 'company_share_id': '058430'}