I'm trying to write a script that iterates through a list of web pages, extracts the links from each page and checks each link to see if the are in a given set of domains. I have the script set up to write two files - pages with links in the given domains are written to one file while the rest are written to the other. I'm essentially trying to sort the pages based on the links in the pages. Below is my script but it doesn't look right. I'd appreciate any pointers to achieve this (I'm new at this, can you tell)
import requests
from bs4 import BeautifulSoup
import re
urls = ['https://www.rose.com', 'https://www.pink.com']
for i in range(len(urls)):
grab = requests.get(urls[i])
soup = BeautifulSoup(grab.text, 'html.parser')
f = open('links_good.txt', 'w')
g = open('links_need_update.txt', 'w')
for link in soup.find_all('a'):
data = link.get('href')
check_url = re.compile(r'(www.x.com) | (www.y.com)')
invalid = check_url.search(data)
if invalid == None
g.write(urls[i])
g.write('\n')
else:
f.write(urls[i])
f.write('\n')
CodePudding user response:
There are some very basic problems with your code:
if invalid == None
is missing a:
at the end, but should also beif invalid is None:
- not all
<a>
elements will have anhref
, so you need to deal with those, or your script will fail. - the regex has some issues (you probably don't want to repeat that first URL and the parentheses are pointless)
- you write the URL to the file every time you find a problem, but you only need to write it to the file if it has a problem at all; or perhaps you wanted a full lists of all the problematic links?
- you rewrite the files on every iteration of your
for
loop, so you only get the final result
Fixing all that (and using a few arbitrary URLs that work):
import requests
from bs4 import BeautifulSoup
import re
urls = ['https://www.gamespot.com', 'https://www.pcgamer.com']
f = open('links_good.txt', 'w')
g = open('links_need_update.txt', 'w')
for i in range(len(urls)):
grab = requests.get(urls[i])
soup = BeautifulSoup(grab.text, 'html.parser')
for link in soup.find_all('a'):
data = link.get('href')
if data is not None:
check_url = re.compile('gamespot.com|pcgamer.com')
result = check_url.search(data)
if result is None:
# if there's no result, the link doesn't match what we need, so write it and stop searching
g.write(urls[i])
g.write('\n')
break
else:
f.write(urls[i])
f.write('\n')
However, there's still a lot of issues:
- you open file handles, but never close them, use
with
instead - you loop over a list using an index, that's not needed, loop over
urls
directly - you compile a regex for efficieny, but do so on every iteration, countering the effect
The same code with those problems fixed:
import requests
from bs4 import BeautifulSoup
import re
urls = ['https://www.gamespot.com', 'https://www.pcgamer.com']
with open('links_good.txt', 'w') as f, open('links_need_update.txt', 'w') as g:
check_url = re.compile('gamespot.com|pcgamer.com')
for url in urls:
grab = requests.get(url)
soup = BeautifulSoup(grab.text, 'html.parser')
for link in soup.find_all('a'):
data = link.get('href')
if data is not None:
result = check_url.search(data)
if result is None:
# if there's no result, the link doesn't match what we need, so write it and stop searching
g.write(url)
g.write('\n')
break
else:
f.write(url)
f.write('\n')
Or, if you want to list all the problematic URLs on the sites:
import requests
from bs4 import BeautifulSoup
import re
urls = ['https://www.gamespot.com', 'https://www.pcgamer.com']
with open('links_good.txt', 'w') as f, open('links_need_update.txt', 'w') as g:
check_url = re.compile('gamespot.com|pcgamer.com')
for url in urls:
grab = requests.get(url)
soup = BeautifulSoup(grab.text, 'html.parser')
good = True
for link in soup.find_all('a'):
data = link.get('href')
if data is not None:
result = check_url.search(data)
if result is None:
g.write(f'{url},{data}\n')
good = False
if good:
f.write(url)
f.write('\n')