Home > Net >  Getting minimum records than expected
Getting minimum records than expected

Time:10-06

I made a program to scrap book names from amazon. There are seven pages of result and total books names are about 380-400 but i got just 110 records . I don,t know where my logic is wrong and did i used wrong classes for that.below is my complete code

from cgitb import reset
from time import sleep
from unittest import result
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time

website = "https://www.amazon.com"
driver = webdriver.Chrome(executable_path='chromedriver.exe')
driver.get(website)

keyword = "hacking books"
search_book = driver.find_element(By.ID,'twotabsearchtextbox')
search_book.send_keys(keyword)
search_button = driver.find_element(By.ID,'nav-search-submit-button')
search_button.click()
driver.implicitly_wait(8)
items = WebDriverWait(driver,10).until(EC.presence_of_all_elements_located((By.XPATH, '//div[contains(@class,"s-result-item s-asin")]')))
def findPageCount(tag,classname):#This function is used to get how many pages of results are and give right output
    result = "no pages found"
    num = driver.find_elements(By.XPATH,'//' tag '[contains(@class,' "\"" classname "\""')]')
    for pgn in num:
        if pgn.text != "Previous":
            result = pgn.text
    return result
pages = findPageCount("span","s-pagination-item s-pagination-disabled")
print("Total Pages = ",pages)
Books = []
given_file = open(file="BooksName.txt",mode='w')
for i in range(int(pages)-1):# It will iterate loop for number of pages
    items = WebDriverWait(driver,10).until(EC.presence_of_all_elements_located((By.XPATH, '//div[contains(@class,"s-result-item s-asin")]')))
    for book in items:
        name = book.find_element(By.XPATH,'//span[contains(@class,"a-size-base a-color-base a-text-normal")]')
        Books.append(name.text)
        given_file.write(name.text '\n') 
    sleep(10)
    NextButton = driver.find_element(By.XPATH,'//a[contains(@class,"s-pagination-item s-pagination-next s-pagination-button s-pagination-separator")]')
    NextButton.click()
print("len of book = ",len(Books))
given_file.flush()
given_file.close()

driver.quit()

CodePudding user response:

First, regarding how you camelcase your variable names, you should read https://peps.python.org/pep-0008/#function-and-variable-names.

Second, You're overcomplicating stuffs. You have an x number of pages: move through them until there are no more, then stop. Also, saving the data in something with structure is preferable, so why not save it in a csv file?

The following example will do just that:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException, TimeoutException
import time as t
import pandas as pd

## define your driver and options [...] ###

wait = WebDriverWait(driver, 5)
url = "https://www.amazon.com"
driver.get(url)

keyword = "hacking books"
search_book = driver.find_element(By.ID,'twotabsearchtextbox')
search_book.send_keys(keyword)
search_button = driver.find_element(By.ID,'nav-search-submit-button')
search_button.click()

big_list = []

while True:
    try:
        items = wait.until(EC.presence_of_all_elements_located((By.XPATH, '//a[@class = "a-link-normal s-underline-text s-underline-link-text s-link-style a-text-normal"]')))
        
        for i in items:
            big_list.append((i.text, i.get_attribute('href')))      
        next_page_button = wait.until(EC.element_to_be_clickable((By.XPATH, '//span[@]//a[contains(text(), "Next")]')))        
        next_page_button.location_once_scrolled_into_view
        t.sleep(1)
        next_page_button.click()
        print('clicked, going to next page')
        t.sleep(1)
    except TimeoutException:
        print('all pages done')
        break
df = pd.DataFrame(big_list, columns = ['Book', 'Url'])
print(df)
df.to_csv('hacking_books.csv')
driver.quit()

Result in terminal:

clicked, going to next page
clicked, going to next page
clicked, going to next page
clicked, going to next page
clicked, going to next page
clicked, going to next page
all pages done
Book    Url
0   The Basics of Hacking and Penetration Testing: Ethical Hacking and Penetration Testing Made Easy    https://www.amazon.com/gp/slredirect/picassoRedirect.html/ref=pa_sp_atf_aps_sr_pg1_1?ie=UTF8&adId=A02310523K29XFJ1GRVP2&url=/Basics-Hacking-Penetration-Testing-Ethical/dp/0124116442/ref=sr_1_1_sspa?keywords=hacking+books&qid=1664981107&qu=eyJxc2MiOiI0Ljk4IiwicXNhIjoiNC41NyIsInFzcCI6IjQuMjAifQ%3D%3D&sr=8-1-spons&psc=1&qualifier=1664981107&id=8979325003452433&widgetName=sp_atf
1   Hunting Cyber Criminals: A Hacker's Guide to Online Intelligence Gathering Tools and Techniques https://www.amazon.com/gp/slredirect/picassoRedirect.html/ref=pa_sp_atf_aps_sr_pg1_1?ie=UTF8&adId=A04824061OF7T6YHGF8OR&url=/OSINT-Toolkit-Intelligence-Gathering-Investigations/dp/1119540925/ref=sr_1_2_sspa?keywords=hacking+books&qid=1664981107&qu=eyJxc2MiOiI0Ljk4IiwicXNhIjoiNC41NyIsInFzcCI6IjQuMjAifQ%3D%3D&sr=8-2-spons&psc=1&qualifier=1664981107&id=8979325003452433&widgetName=sp_atf
2   Hacking the Hacker: Learn From the Experts Who Take Down Hackers    https://www.amazon.com/gp/slredirect/picassoRedirect.html/ref=pa_sp_atf_aps_sr_pg1_1?ie=UTF8&adId=A0457143Z9IJ02GCQ4AB&url=/Hacking-Hacker-Learn-Experts-Hackers/dp/1119396212/ref=sr_1_3_sspa?keywords=hacking+books&qid=1664981107&qu=eyJxc2MiOiI0Ljk4IiwicXNhIjoiNC41NyIsInFzcCI6IjQuMjAifQ%3D%3D&sr=8-3-spons&psc=1&qualifier=1664981107&id=8979325003452433&widgetName=sp_atf
3   The Web Application Hacker's Handbook: Finding and Exploiting Security Flaws    https://www.amazon.com/gp/slredirect/picassoRedirect.html/ref=pa_sp_atf_aps_sr_pg1_1?ie=UTF8&adId=A04084041RNQCOYIDN5WY&url=/Web-Application-Hackers-Handbook-Exploiting/dp/1118026470/ref=sr_1_4_sspa?keywords=hacking+books&qid=1664981107&qu=eyJxc2MiOiI0Ljk4IiwicXNhIjoiNC41NyIsInFzcCI6IjQuMjAifQ%3D%3D&sr=8-4-spons&psc=1&qualifier=1664981107&id=8979325003452433&widgetName=sp_atf
4   Hacking: The Art of Exploitation, 2nd Edition   https://www.amazon.com/Hacking-Art-Exploitation-Jon-Erickson/dp/1593271441/ref=sr_1_5?keywords=hacking books&qid=1664981107&qu=eyJxc2MiOiI0Ljk4IiwicXNhIjoiNC41NyIsInFzcCI6IjQuMjAifQ==&sr=8-5
... ... ...
385 Developing IoT Projects with ESP32: Automate your home or business with inexpensive Wi-Fi devices   https://www.amazon.com/gp/slredirect/picassoRedirect.html/ref=pa_sp_mtf_aps_sr_pg7_1?ie=UTF8&adId=A008910836T4WAOCVEAEL&url=/Developing-IoT-Projects-ESP32-inexpensive/dp/1838641165/ref=sr_1_304_sspa?keywords=hacking+books&qid=1664981139&qu=eyJxc2MiOiI0Ljk4IiwicXNhIjoiNC41NyIsInFzcCI6IjQuMjAifQ%3D%3D&sr=8-304-spons&psc=1&qualifier=1664981139&id=7572201058395198&widgetName=sp_mtf
386 Hacks for Minecrafters: Command Blocks: The Unofficial Guide to Tips and Tricks That Other Guides Won't Teach You (Unofficial Minecrafters Guides)  https://www.amazon.com/gp/slredirect/picassoRedirect.html/ref=pa_sp_mtf_aps_sr_pg7_1?ie=UTF8&adId=A0812825CZ8EWA9608VI&url=/Hacks-Minecrafters-Command-Blocks-Unofficial/dp/1510741070/ref=sr_1_305_sspa?keywords=hacking+books&qid=1664981139&qu=eyJxc2MiOiI0Ljk4IiwicXNhIjoiNC41NyIsInFzcCI6IjQuMjAifQ%3D%3D&sr=8-305-spons&psc=1&qualifier=1664981139&id=7572201058395198&widgetName=sp_mtf
387 Hands-On Penetration Testing on Windows: Unleash Kali Linux, PowerShell, and Windows debugging tools for security testing and analysis  https://www.amazon.com/gp/slredirect/picassoRedirect.html/ref=pa_sp_mtf_aps_sr_pg7_1?ie=UTF8&adId=A04884183SDLRYUFRC55G&url=/Hands-Penetration-Testing-Windows-PowerShell/dp/1788295668/ref=sr_1_306_sspa?keywords=hacking+books&qid=1664981139&qu=eyJxc2MiOiI0Ljk4IiwicXNhIjoiNC41NyIsInFzcCI6IjQuMjAifQ%3D%3D&sr=8-306-spons&psc=1&qualifier=1664981139&id=7572201058395198&widgetName=sp_mtf
388 Black Hat Python: Python Programming for Hackers and Pentesters https://www.amazon.com/Black-Hat-Python-Programming-Pentesters/dp/1593275900/ref=sr_1_307?keywords=hacking books&qid=1664981139&qu=eyJxc2MiOiI0Ljk4IiwicXNhIjoiNC41NyIsInFzcCI6IjQuMjAifQ==&sr=8-307
389 Dark Territory: The Secret History of Cyber War https://www.amazon.com/Dark-Territory-Secret-History-Cyber/dp/1476763267/ref=sr_1_308?keywords=hacking books&qid=1664981139&qu=eyJxc2MiOiI0Ljk4IiwicXNhIjoiNC41NyIsInFzcCI6IjQuMjAifQ==&sr=8-308
390 rows × 2 columns

For pandas documentation, go to https://pandas.pydata.org/docs/

Also Selenium documentation: https://www.selenium.dev/documentation/

  • Related