I am writing a code that is supposed to open a url, identify the 3rd link and repeat this process 3 times (each time with the new url).
I wrote a loop (below), but it seems to each time sart over with the original url.
Can someone help me fix my code?
import urllib.request, urllib.parse, urllib.error
from urllib.parse import urljoin
from bs4 import BeautifulSoup
#blanc list
l = []
#starting url
url = input('Enter URL: ')
if len(url) < 1:
url = 'http://py4e-data.dr-chuck.net/known_by_Fikret.html'
#loop
for _ in range(4):
html = urllib.request.urlopen(url).read() #open url
soup = BeautifulSoup(html, 'html.parser') #parse through BeautifulSoup
tags = soup('a') #extract tags
for tag in tags:
url = tag.get('href', None) #extract links from tags
l.append(url) #add the links to a list
url = l[2:3] #slice the list to extract the 3rd url
url = ' '.join(str(e) for e in url) #change the type to string
print(url)
Current Output:
http://py4e-data.dr-chuck.net/known_by_Montgomery.html
http://py4e-data.dr-chuck.net/known_by_Montgomery.html
http://py4e-data.dr-chuck.net/known_by_Montgomery.html
http://py4e-data.dr-chuck.net/known_by_Montgomery.html
Desired output:
http://py4e-data.dr-chuck.net/known_by_Montgomery.html
http://py4e-data.dr-chuck.net/known_by_Mhairade.html
http://py4e-data.dr-chuck.net/known_by_Butchi.html
http://py4e-data.dr-chuck.net/known_by_Anayah.html
CodePudding user response:
You need to define the empty list within the loop. The following code works:
import urllib.request, urllib.parse, urllib.error
from urllib.parse import urljoin
from bs4 import BeautifulSoup
#blanc list
# l = []
#starting url
url = input('Enter URL: ')
if len(url) < 1:
url = 'http://py4e-data.dr-chuck.net/known_by_Fikret.html'
#loop
for _ in range(4):
l = []
html = urllib.request.urlopen(url).read() #open url
soup = BeautifulSoup(html, 'html.parser') #parse through BeautifulSoup
tags = soup('a') #extract tags
for tag in tags:
url = tag.get('href', None) #extract links from tags
l.append(url) #add the links to a list
url = l[2:3] #slice the list to extract the 3rd url
url = ' '.join(str(e) for e in url) #change the type to string
print(url)
Result in terminal:
http://py4e-data.dr-chuck.net/known_by_Montgomery.html
http://py4e-data.dr-chuck.net/known_by_Mhairade.html
http://py4e-data.dr-chuck.net/known_by_Butchi.html
http://py4e-data.dr-chuck.net/known_by_Anayah.html
CodePudding user response:
Just making it simple for You in your way with 1 loop:
for _ in range(4):
html = urllib.request.urlopen(url).read() #open url
soup = BeautifulSoup(html, 'html.parser') #parse through BeautifulSoup
tag = soup('a')[2]
url = tag.get('href', None)
print(url)