The problem is that I want this code to find all the links in "input.html" file, but it finds and shows only the first link. Below is the code:
import codecs
from bs4 import BeautifulSoup
fd = codecs.open('input.html', 'r')
def clean(html):
soup = BeautifulSoup(html, "lxml")
for link in soup.find_all('a'):
link.extract()
text = link.get('href')
return text
CodePudding user response:
It could be:
import codecs
from bs4 import BeautifulSoup
fd = codecs.open('input.html', 'r')
text = []
def clean(html):
soup = BeautifulSoup(html, "lxml")
for link in soup.find_all('a'):
link.extract()
text.append(link.get('href'))
return text
CodePudding user response:
You're returning the text at the end of the loop which iterates only once. Do this:
def clean(html):
soup = BeautifulSoup(html, "lxml")
links = []
for link in soup.find_all('a'):
link.extract()
text = link.get('href')
links.append(text)
return links
Also instead of a function you can have a simple list comprehension:
soup = BeautifulSoup(html, "lxml")
links = [link.extract().get('href') for link in soup.find_all('a')]
CodePudding user response:
It seems you are getting a link at end of the loop. You may use this:
def clean(html):
soup = BeautifulSoup(html, 'html.parser')
hrefs = soup.find_all('a')
links = []
if hrefs:
for href in hrefs:
href.extract()
link = href.get('href')
links.append(link)
return links