Home > Net >  Looking for a way to scrap these urls using Selenium from paginated website
Looking for a way to scrap these urls using Selenium from paginated website

Time:09-08

Im trying to scrap urls from a website and then output them in a csv. The code is working, but not going to the next page as the website is paginated. While the counter is increasing and changing the url, the page that is loading is page 1.

How do I resolve this?

import csv
from selenium import webdriver

MAX_PAGE_NUM =3
MAX_PAGE_DIG=1

driver = webdriver.Firefox()
for i in range(1, MAX_PAGE_NUM   1):
    page_num = (MAX_PAGE_DIG - len (str(i))) *'0'   str(i)
    driver.get("https://www.example.com/user/learn/freehelp/dynTest/1/Landing/1/page" page_num)
    find_href = driver.find_elements_by_xpath('//div[@]/a')
    num_page_items= len(find_href)
    with open('links1.csv', 'a') as f:
        for i in range(num_page_items):
            for my_href in find_href:
                f.write(my_href.get_attribute("href")  '\n')
    
driver.close()

CodePudding user response:

You don't need selenium for this task: that info is accessible with requests. Here is one way of getting that data:

import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
from tqdm import tqdm

headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}

big_list = []

for x in tqdm(range(1, 5)): ## there are 100 pages, so you may want to increase the range to 101
    url = f'https://www.studypool.com/user/learn/freehelp/dynTest/1/Landing/1/page/{x}'


    r = requests.get(url, headers=headers)
    soup = bs(r.text, 'html.parser')
    table = soup.select_one('table.feedTable')
    titles = table.select('p.qn-title')
    for t in titles:
        title = t.get_text(strip=True)
        link = 'https://www.studypool.com'   t.parent.get('href')
        big_list.append((title, link))
df = pd.DataFrame(big_list, columns = ['Thread', 'Url'])
print(df)
df.to_csv('another_issue_solved.csv')

This will save the dataframe as a csv file, and print out in terminal:

    Thread  Url
0   TAX 4011 FIU Preparing a Tax Memoir for Your New Client Memorandum  https://www.studypool.com/discuss/18364986/preparing-a-tax-memor-for-your-new-client
1   Accounting Twilio Companys Billing System Case Study Project    https://www.studypool.com/discuss/18330081/briefly-outline-the-problem-statement-objectives-and-goals-and-your-approach-to-the-needs-assessment-and-research-methodology-in-2-4-pages
2   University of Houston Accounting Oil and Gas Accounting Issues Paper    https://www.studypool.com/discuss/18330479/issues-of-exporting-lng-from-the-u-s-course-oil-and-gas-accounting
3   RC Accounting Business Management Income and Balance Sheet Analysis     https://www.studypool.com/discuss/18330477/evaluating-performance-and-benchmarking-2
4   SNHU Accounting Paper   https://www.studypool.com/discuss/18293340/draft-for-introduction-for-final-project
...     ...     ...
155     BPA 331 University of Phoenix ?time Value of Money Excel Analysis   https://www.studypool.com/discuss/18639381/bpa-331-time-value-of-money-assignment
156     Accounting Business Communication Agenda for The First Team Meeting Portfolio Tasks     https://www.studypool.com/discuss/18659279/portfolio-2-1
157     BPA 331 University of Phoenix ?life Cycle Costing Analysis  https://www.studypool.com/discuss/18639378/bpa-331-life-cycle-costing-analysis-assignment
158     Accounting International Business and Corporate Strategies Group Essay  https://www.studypool.com/discuss/18659434/write-a-part-of-body-paragraph-of-an-essay
159     Aklan Catholic College Direct Labor Cost Assigned to Production Accounting Questions    https://www.studypool.com/discuss/18642386/accounting-413

BeautifulSoup docs: https://beautiful-soup-4.readthedocs.io/en/latest/index.html

Requests docs: https://requests.readthedocs.io/en/latest/

Pandas: https://pandas.pydata.org/pandas-docs/stable/index.html

And tqdm: https://tqdm.github.io/

CodePudding user response:

Try this if this is your web address

driver.get("https://www.studypool.com/user/learn/freehelp/dynTest/1/Landing/1/page1//page/" page_num)
  • Related