How to scrape multiple pages of a site using paging using BeautifulSoup and requests?-CodePudding

I created a scraper using BeautifulSoup and requests that scrapes the search results of the site Ask.com based on the keywords entered by the user. For now this scraper is limited to only one page of scraped search results. Here is the basic code of my scraper:


def search(request):
    if request.method == 'POST':
        search = request.POST['search']
        url = 'https://www.ask.com/web?q=' search
        res = requests.get(url)
        soup = bs(res.text, 'lxml')

        result_listings = soup.find_all('div', {'class': 'PartialSearchResults-item'})

        final_result = []

        for result in result_listings:
            result_title = result.find(class_='PartialSearchResults-item-title').text
            result_url = result.find('a').get('href')
            result_desc = result.find(class_='PartialSearchResults-item-abstract').text

           
            final_result.append((result_title, result_url, result_desc))

        context = {
            'final_result': final_result
        }

And I would like to make sure that BeautifulSoup can scrape the other 5 pages of search results by following the pagination, I modified my code like this:



def search(request):
    if request.method == 'POST':
        search = request.POST['search']
        url = 'https://www.ask.com/web?q=' search
        res = requests.get(url)
        soup = bs(res.text, 'lxml')
       

        result_listings = soup.find_all('div', {'class': 'PartialSearchResults-item'})

        final_result = []

        for result in result_listings:
            while True:
                result_title = result.find(class_='PartialSearchResults-item-title').text
                result_url = result.find('a').get('href')
                result_desc = result.find(class_='PartialSearchResults-item-abstract').text

                result_nextpage = result.find('a').get('PartialWebPagination-next')
                if result_nextpage.find_all('div', {'class': 'PartialSearchResults-item'}):
                    url = 'https://www.ask.com/web?q='  search   result.find('a').get('PartialWebPagination-next')
                    return url
                else :
                    final_result.append((result_title, result_url, result_desc))


           
                

        context = {
            'final_result': final_result
        }

After when I run python manage.py runserver in order to run my server and when I enter the keywords to search in the appropriate search bar, instead of sending me the scraping results the page keeps loading without stopping. I therefore ask for help from more experienced members of the community because I do not know where my error lies. inspired by this question I modified the url variable as well:

url = "https://www.ask.com/search?q="   search  "&start="   str((page - 1) * 5)

and when I executed, I obtained the following error name 'page' is not defined . So I ask for the help of the community. Thank you.

CodePudding user response：

If your page is working for single page then with a little change it will work on next pages also. Just try to change page number in the url as ask.com supports it.

def search(request):
    if request.method == 'POST':
        search = request.POST['search']
        max_pages_to_scrap = 5
        final_result = []
        for page_num in range(1, max_pages_to_scrap 1):
            url = "https://www.ask.com/web?q="   search   "&qo=pagination&page="   str(page_num)
            res = requests.get(url)
            soup = bs(res.text, 'lxml')
            result_listings = soup.find_all('div', {'class': 'PartialSearchResults-item'})

            for result in result_listings:
                result_title = result.find(class_='PartialSearchResults-item-title').text
                result_url = result.find('a').get('href')
                result_desc = result.find(class_='PartialSearchResults-item-abstract').text
           
                final_result.append((result_title, result_url, result_desc))

        context = {'final_result': final_result}