Home > Net >  Script to extract links from page and check the domain
Script to extract links from page and check the domain

Time:11-02

I'm trying to write a script that iterates through a list of web pages, extracts the links from each page and checks each link to see if the are in a given set of domains. I have the script set up to write two files - pages with links in the given domains are written to one file while the rest are written to the other. I'm essentially trying to sort the pages based on the links in the pages. Below is my script but it doesn't look right. I'd appreciate any pointers to achieve this (I'm new at this, can you tell)

import requests
from bs4 import BeautifulSoup
import re


urls = ['https://www.rose.com', 'https://www.pink.com']
for i in range(len(urls)):
    grab = requests.get(urls[i])
    soup = BeautifulSoup(grab.text, 'html.parser')
    f = open('links_good.txt', 'w')
    g = open('links_need_update.txt', 'w')
    for link in soup.find_all('a'):
        data = link.get('href')
        check_url = re.compile(r'(www.x.com)  | (www.y.com)')
        invalid = check_url.search(data)
        if invalid == None
            g.write(urls[i])
            g.write('\n')
        else:
            f.write(urls[i])
            f.write('\n')

CodePudding user response:

There are some very basic problems with your code:

  • if invalid == None is missing a : at the end, but should also be if invalid is None:
  • not all <a> elements will have an href, so you need to deal with those, or your script will fail.
  • the regex has some issues (you probably don't want to repeat that first URL and the parentheses are pointless)
  • you write the URL to the file every time you find a problem, but you only need to write it to the file if it has a problem at all; or perhaps you wanted a full lists of all the problematic links?
  • you rewrite the files on every iteration of your for loop, so you only get the final result

Fixing all that (and using a few arbitrary URLs that work):

import requests
from bs4 import BeautifulSoup
import re


urls = ['https://www.gamespot.com', 'https://www.pcgamer.com']
f = open('links_good.txt', 'w')
g = open('links_need_update.txt', 'w')
for i in range(len(urls)):
    grab = requests.get(urls[i])
    soup = BeautifulSoup(grab.text, 'html.parser')
    for link in soup.find_all('a'):
        data = link.get('href')
        if data is not None:
            check_url = re.compile('gamespot.com|pcgamer.com')
            result = check_url.search(data)
            if result is None:
                # if there's no result, the link doesn't match what we need, so write it and stop searching
                g.write(urls[i])
                g.write('\n')
                break
    else:
        f.write(urls[i])
        f.write('\n')

However, there's still a lot of issues:

  • you open file handles, but never close them, use with instead
  • you loop over a list using an index, that's not needed, loop over urls directly
  • you compile a regex for efficieny, but do so on every iteration, countering the effect

The same code with those problems fixed:

import requests
from bs4 import BeautifulSoup
import re


urls = ['https://www.gamespot.com', 'https://www.pcgamer.com']
with open('links_good.txt', 'w') as f, open('links_need_update.txt', 'w') as g:
    check_url = re.compile('gamespot.com|pcgamer.com')
    for url in urls:
        grab = requests.get(url)
        soup = BeautifulSoup(grab.text, 'html.parser')
        for link in soup.find_all('a'):
            data = link.get('href')
            if data is not None:
                result = check_url.search(data)
                if result is None:
                    # if there's no result, the link doesn't match what we need, so write it and stop searching
                    g.write(url)
                    g.write('\n')
                    break
        else:
            f.write(url)
            f.write('\n')

Or, if you want to list all the problematic URLs on the sites:

import requests
from bs4 import BeautifulSoup
import re


urls = ['https://www.gamespot.com', 'https://www.pcgamer.com']
with open('links_good.txt', 'w') as f, open('links_need_update.txt', 'w') as g:
    check_url = re.compile('gamespot.com|pcgamer.com')
    for url in urls:
        grab = requests.get(url)
        soup = BeautifulSoup(grab.text, 'html.parser')
        good = True
        for link in soup.find_all('a'):
            data = link.get('href')
            if data is not None:
                result = check_url.search(data)
                if result is None:
                    g.write(f'{url},{data}\n')
                    good = False
    if good:
        f.write(url)
        f.write('\n')
  • Related