Home > Enterprise >  How can I narrow down the results of a scrape with beautifulsoup requests?
How can I narrow down the results of a scrape with beautifulsoup requests?

Time:06-03

My code's function is to read a list of URLS on an xlxs sheet (i.e stackoverflow.com).

It then goes to stackoverflow.com and checks to see if there is an Instagram account linked on the home page, if so it returns the link to that and writes it in the adjacent column.

However, some sites will have it listed in multiple places, header, footer or have a feed which will return multiple results to the cell.

Is there a way to return just a single result?

for cell in sheet[col][1:]:
    try:
        url = cell.value
        r = requests.get(url)
        ig_get = ['instagram.com']
        ig_get_present = []
        soup = BeautifulSoup(r.content, 'html5lib')
        all_links = soup.find_all('a', href=True)
        print(cell.value)
        for ig_get in ig_get:
            for link in all_links:
                if ig_get in link.attrs['href']:
                    ig_get_present.append(link.attrs['href'])
                    ig_got = str(ig_get_present)
                    print(ig_got)
                    sheet.cell(cell.row, col2).value = ig_got
    except requests.exceptions.ConnectionError:
        pass
    except requests.exceptions.TooManyRedirects:
        pass
    except requests.exceptions.MissingSchema:
        pass

Edit for clarity:

Some domains will have multiple links to their social media pages, i.e one in the header, one in the footer, one in the navigation bar etc OR a mirror of their social media feed. In these cases, I am outputted with multiple of the same link in the cell:

['https://instagram.com/xxx', 'https://instagram.com/xxx', 'https://instagram.com/xxx']

I would only want one of these, not all of them.

CodePudding user response:

If all you want is to only input the first match into the cell then all you really need is a break statement placed immediately after the first match.

For example:

...
...
url = cell.value
res = requests.get(url)
domain = 'instagram.com'
urls = []
soup = BeautifulSoup(res.content, 'html5lib')
all_links = soup.find_all('a', href=True)
for link in all_links:
    if domain in link['href']:
        url = link['href']
        urls.append(url)
        sheet.cell(cell.row, col2).value = url
        break
...
...

The break statement in python is a control flow statement that immediately breaks you out of whatever loop your code is executing.

You can read more about it in the python docs https://docs.python.org/3/tutorial/controlflow.html#break-and-continue-statements-and-else-clauses-on-loops

  • Related