I need to achieve a script that scraps URL's from a blog page and identifies if the URL contains certain key words within the link, then print out within a CSV file which blog post URL has the keyword links identified.
As the blog page has pagination and over 35 pages/300 blog posts, I'm unsure how I go about this. The URL's that I'm looking for are within each individual blog post.
So far, I've managed to follow a few tutorials on how to get each blog post URL from the homepage following the pagination's.
CodePudding user response:
It is nearly the same, define your empty list to store results of specialUrls and iterate over your initial result list of urls:
data = []
for url in result:
r=requests.get(url).text
soup=BeautifulSoup(r,"lxml")
data.append('specialUrl')
To avoid duplicates / not necessary requests iterate over set()
:
data = []
for url in set(result):
r=requests.get(url).text
soup=BeautifulSoup(r,"lxml")
data.append('FINDSPECIALURL')
Just in case you can also use break
to leave the while loop.
Example
Note This will only scrape the links from first blog page to your results - remove break from end of the while to scrape all the blog pages
from bs4 import BeautifulSoup
import pandas as pd
page=1
result=[]
while True:
r=requests.get(f"https://www.snapfish.co.uk/blog/page/{page}/").text
soup=BeautifulSoup(r,"lxml")
product=soup.find_all("article",{'class':'post_list'})
for data in product:
result.append(data.find('a').get('href'))
if soup.find("a",class_='next page-numbers') is None:
break
page =1
break#remove break to scrape all the blog pages
data = []
for url in result:
r=requests.get(url).text
soup=BeautifulSoup(r,"lxml")
for a in soup.select('a[href*="design-detail"]'):
data.append({
'urlFrom':url,
'urlTo':a['href']
})
pd.DataFrame(data).drop_duplicates().to_csv('result.csv', index=False)