Im trying to scrap urls from a website and then output them in a csv. The code is working, but not going to the next page as the website is paginated. While the counter is increasing and changing the url, the page that is loading is page 1.
How do I resolve this?
import csv
from selenium import webdriver
MAX_PAGE_NUM =3
MAX_PAGE_DIG=1
driver = webdriver.Firefox()
for i in range(1, MAX_PAGE_NUM 1):
page_num = (MAX_PAGE_DIG - len (str(i))) *'0' str(i)
driver.get("https://www.example.com/user/learn/freehelp/dynTest/1/Landing/1/page" page_num)
find_href = driver.find_elements_by_xpath('//div[@]/a')
num_page_items= len(find_href)
with open('links1.csv', 'a') as f:
for i in range(num_page_items):
for my_href in find_href:
f.write(my_href.get_attribute("href") '\n')
driver.close()
CodePudding user response:
You don't need selenium for this task: that info is accessible with requests. Here is one way of getting that data:
import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
from tqdm import tqdm
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}
big_list = []
for x in tqdm(range(1, 5)): ## there are 100 pages, so you may want to increase the range to 101
url = f'https://www.studypool.com/user/learn/freehelp/dynTest/1/Landing/1/page/{x}'
r = requests.get(url, headers=headers)
soup = bs(r.text, 'html.parser')
table = soup.select_one('table.feedTable')
titles = table.select('p.qn-title')
for t in titles:
title = t.get_text(strip=True)
link = 'https://www.studypool.com' t.parent.get('href')
big_list.append((title, link))
df = pd.DataFrame(big_list, columns = ['Thread', 'Url'])
print(df)
df.to_csv('another_issue_solved.csv')
This will save the dataframe as a csv file, and print out in terminal:
Thread Url
0 TAX 4011 FIU Preparing a Tax Memoir for Your New Client Memorandum https://www.studypool.com/discuss/18364986/preparing-a-tax-memor-for-your-new-client
1 Accounting Twilio Companys Billing System Case Study Project https://www.studypool.com/discuss/18330081/briefly-outline-the-problem-statement-objectives-and-goals-and-your-approach-to-the-needs-assessment-and-research-methodology-in-2-4-pages
2 University of Houston Accounting Oil and Gas Accounting Issues Paper https://www.studypool.com/discuss/18330479/issues-of-exporting-lng-from-the-u-s-course-oil-and-gas-accounting
3 RC Accounting Business Management Income and Balance Sheet Analysis https://www.studypool.com/discuss/18330477/evaluating-performance-and-benchmarking-2
4 SNHU Accounting Paper https://www.studypool.com/discuss/18293340/draft-for-introduction-for-final-project
... ... ...
155 BPA 331 University of Phoenix ?time Value of Money Excel Analysis https://www.studypool.com/discuss/18639381/bpa-331-time-value-of-money-assignment
156 Accounting Business Communication Agenda for The First Team Meeting Portfolio Tasks https://www.studypool.com/discuss/18659279/portfolio-2-1
157 BPA 331 University of Phoenix ?life Cycle Costing Analysis https://www.studypool.com/discuss/18639378/bpa-331-life-cycle-costing-analysis-assignment
158 Accounting International Business and Corporate Strategies Group Essay https://www.studypool.com/discuss/18659434/write-a-part-of-body-paragraph-of-an-essay
159 Aklan Catholic College Direct Labor Cost Assigned to Production Accounting Questions https://www.studypool.com/discuss/18642386/accounting-413
BeautifulSoup docs: https://beautiful-soup-4.readthedocs.io/en/latest/index.html
Requests docs: https://requests.readthedocs.io/en/latest/
Pandas: https://pandas.pydata.org/pandas-docs/stable/index.html
And tqdm: https://tqdm.github.io/
CodePudding user response:
Try this if this is your web address
driver.get("https://www.studypool.com/user/learn/freehelp/dynTest/1/Landing/1/page1//page/" page_num)