I created a scraper using BeautifulSoup and requests that scrapes the search results of the site Ask.com based on the keywords entered by the user. For now this scraper is limited to only one page of scraped search results. Here is the basic code of my scraper:
def search(request):
if request.method == 'POST':
search = request.POST['search']
url = 'https://www.ask.com/web?q=' search
res = requests.get(url)
soup = bs(res.text, 'lxml')
result_listings = soup.find_all('div', {'class': 'PartialSearchResults-item'})
final_result = []
for result in result_listings:
result_title = result.find(class_='PartialSearchResults-item-title').text
result_url = result.find('a').get('href')
result_desc = result.find(class_='PartialSearchResults-item-abstract').text
final_result.append((result_title, result_url, result_desc))
context = {
'final_result': final_result
}
And I would like to make sure that BeautifulSoup can scrape the other 5 pages of search results by following the pagination, I modified my code like this:
def search(request):
if request.method == 'POST':
search = request.POST['search']
url = 'https://www.ask.com/web?q=' search
res = requests.get(url)
soup = bs(res.text, 'lxml')
result_listings = soup.find_all('div', {'class': 'PartialSearchResults-item'})
final_result = []
for result in result_listings:
while True:
result_title = result.find(class_='PartialSearchResults-item-title').text
result_url = result.find('a').get('href')
result_desc = result.find(class_='PartialSearchResults-item-abstract').text
result_nextpage = result.find('a').get('PartialWebPagination-next')
if result_nextpage.find_all('div', {'class': 'PartialSearchResults-item'}):
url = 'https://www.ask.com/web?q=' search result.find('a').get('PartialWebPagination-next')
return url
else :
final_result.append((result_title, result_url, result_desc))
context = {
'final_result': final_result
}
After when I run python manage.py runserver
in order to run my server and when I enter the keywords to search in the appropriate search bar, instead of sending me the scraping results the page keeps loading without stopping. I therefore ask for help from more experienced members of the community because I do not know where my error lies. inspired by this question I modified the url variable as well:
url = "https://www.ask.com/search?q=" search "&start=" str((page - 1) * 5)
and when I executed, I obtained the following error name 'page' is not defined . So I ask for the help of the community. Thank you.
CodePudding user response:
If your page is working for single page then with a little change it will work on next pages also. Just try to change page number in the url as ask.com
supports it.
def search(request):
if request.method == 'POST':
search = request.POST['search']
max_pages_to_scrap = 5
final_result = []
for page_num in range(1, max_pages_to_scrap 1):
url = "https://www.ask.com/web?q=" search "&qo=pagination&page=" str(page_num)
res = requests.get(url)
soup = bs(res.text, 'lxml')
result_listings = soup.find_all('div', {'class': 'PartialSearchResults-item'})
for result in result_listings:
result_title = result.find(class_='PartialSearchResults-item-title').text
result_url = result.find('a').get('href')
result_desc = result.find(class_='PartialSearchResults-item-abstract').text
final_result.append((result_title, result_url, result_desc))
context = {'final_result': final_result}