beautiful soup, looks for only one pattern-CodePudding

The problem is that I want this code to find all the links in "input.html" file, but it finds and shows only the first link. Below is the code:

import codecs
from bs4 import BeautifulSoup
fd = codecs.open('input.html', 'r')

def clean(html): 
    soup = BeautifulSoup(html, "lxml")
    for link in soup.find_all('a'):
        link.extract()
        text = link.get('href')
        return text

CodePudding user response：

It could be:

import codecs
from bs4 import BeautifulSoup
fd = codecs.open('input.html', 'r')

text = []

def clean(html): 
    soup = BeautifulSoup(html, "lxml")
    for link in soup.find_all('a'):
        link.extract()
        text.append(link.get('href'))
    return text

CodePudding user response：

You're returning the text at the end of the loop which iterates only once. Do this:

def clean(html): 
    soup = BeautifulSoup(html, "lxml")
    links = []
    for link in soup.find_all('a'):
        link.extract()
        text = link.get('href')
        links.append(text)
    return links

Also instead of a function you can have a simple list comprehension:

soup = BeautifulSoup(html, "lxml")
links = [link.extract().get('href') for link in soup.find_all('a')]

CodePudding user response：

It seems you are getting a link at end of the loop. You may use this:

def clean(html):
    soup = BeautifulSoup(html, 'html.parser')
    hrefs = soup.find_all('a')
    links = []
    if hrefs:
        for href in hrefs:
            href.extract()
            link = href.get('href')
            links.append(link)
        return links