So I started with python some days ago and now tried to make a function that gives me all subpages of websites. I know it may not be the most elegant function but I had been pretty proud to see it working. But for some reason unknown to me, my function does not work anymore. I could've sworn I haven't changed that function since it worked the last time. But after hours of attempts to debug I am slowly doubting myself. Can you maybe take a look why my function does not output to a .txt file anymore? I just get handed an empty text file. Though if I delete it atleast creates a new (empty) one.
I tried to move the save strings part out of the try block, which didn't. work. I also tried all_urls.flush()
to maybe save everything. I restarted the PC in the hopes that something in the background accessed the file and made me unable to write on it. I also renamed the file it supposed to save as, so as to generate something truly fresh. Still the same problem. I also controlled that the link
from the loop gets given as a string, so that shouldn't be a problem. I also tried:
print(link, file=all_urls, end='\n')
as a replacement to
all_urls.write(link)
all_urls.write('\n')
with no result.
My full function:
def get_subpages(url):
# gets all subpage links from a website that start with the given url
from urllib.request import urlopen, Request
from bs4 import BeautifulSoup
links = [url]
tested_links = []
to_test_links = links
# open a .txt file to save results into
all_urls = open('all_urls.txt', 'w')
problematic_pages = open('problematic_pages.txt', 'w')
while len(to_test_links)>0:
for link in to_test_links:
print('the link we are testing right now:', link)
# add the current link to the tested list
tested_links.append(link)
try:
print(type(link))
all_urls.write(link)
all_urls.write('\n')
# Save it to the -txt file and make an abstract
# get the link ready to be accessed
req = Request(link)
html_page = urlopen(req)
soup = BeautifulSoup(html_page, features="html.parser")
# empty previous temporary links
templinks = []
# move the links on the subpage link to templinks
for sublink in soup.findAll('a'):
templinks.append(sublink.get('href'))
# clean off accidental 'None' values
templinks = list(filter(lambda item: item is not None, templinks))
for templink in templinks:
# make sure we have still the correct website and don't accidentally crawl instagram etc.
# also avoid duplicates
if templink.find(url) == 0 and templink not in links:
links.append(templink)
#and lastly refresh the to_test_links list with the newly found links before going back into the loop
to_test_links = (list(set(links) ^ set(tested_links)))
except:
# Save it to the ERROR -txt file and make an abstract
problematic_pages.write(link)
problematic_pages.write('\n')
print('ERROR: All links on', link, 'not retrieved. If need be check for new subpages manually.')
all_urls.close()
problematic_pages.close()
return links
CodePudding user response:
I can't reproduce this, but I've had inexplicable [to me at least] errors with file handling that were resolved when I wrote from inside a with
.
[Just make sure to remove the lines involving allurl
in your current code first just in case - or just try this with a different filename while checking if it works]
Since you're appending all the urls to tested_links
anyway, you could just write it all at once after the while loop
with open('all_urls.txt', 'w') as f:
f.write('\n'.join('tested_links') '\n')
or, if you have to write link by link, you can append by opening with mode='a'
:
# before the while, if you're not sure the file exists
# [and/or to clear previous data from file]
# with open('all_urls.txt', 'w') as f: f.write('')
# and inside the try block:
with open('all_urls.txt', 'a') as f:
f.write(f'{link}\n')
CodePudding user response:
not a direct answer, but in my early days this happened with me. The requests module of Python sends request with headers indicating python and that could be quickly detected by websites and your IP can get blocked and you are getting unusual response that's why your working function is not working now.
Solution:
Use natural request headers see the code below
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'}
r = requests.get(URL, headers=headers)
Use a proxy in case you got blocked on your IP it is highly recommended
CodePudding user response:
Here is your slitely changed script with marked (*****************) changes:
def get_subpages(url):
# gets all subpage links from a website that start with the given url
from urllib.request import urlopen, Request
from bs4 import BeautifulSoup
links = [url]
# ******************* added sublinks_list variable ******************
sublinks_list = []
tested_links = []
to_test_links = links
# open a .txt file to save results into
all_urls = open('all_urls.txt', 'w')
problematic_pages = open('problematic_pages.txt', 'w')
while len(to_test_links)>0:
for link in to_test_links:
print('the link we are testing right now:', link)
# add the current link to the tested list
tested_links.append(link)
try:
all_urls.write(link)
all_urls.write('\n')
# Save it to the -txt file and make an abstract
# get the link ready to be accessed
req = Request(link)
html_page = urlopen(req)
soup = BeautifulSoup(html_page, features="html.parser")
# empty previous temporary links
templinks = []
# move the links on the subpage link to templinks
sublinks = soup.findAll('a')
for sublink in sublinks:
#templinks.append(sublink.get('href')) ***************** changed the line with next row *****************
templinks.append(sublink['href'])
# clean off accidental 'None' values
templinks = list(filter(lambda item: item is not None, templinks))
for templink in templinks:
# make sure we have still the correct website and don't accidentally crawl instagram etc.
# also avoid duplicates
#if templink.find(url) == 0 and templink not in links: ******************* changed the line with next row *****************
if templink not in sublinks_list:
#links.append(templink) ******************* changed the line with next row *****************
sublinks_list.append(templink)
all_urls.write(templink '\n') # ******************* added this line *****************
#and lastly refresh the to_test_links list with the newly found links before going back into the loop
to_test_links = (list(set(links) ^ set(tested_links)))
except:
# Save it to the ERROR -txt file and make an abstract
problematic_pages.write(link)
problematic_pages.write('\n')
print('ERROR: All links on', link, 'not retrieved. If need be check for new subpages manually.')
all_urls.close()
problematic_pages.close()
return links
lnks = get_subpages('https://www.jhanley.com/blog/pyscript-creating-installable-offline-applications/') # # ******************* url used for testing *****************
It works and there is over 180 links in the file. Please test it yourself. There are still some missfits and questionable sintax so you should test your code thoroughly again - but the part with writing links into a file works.
Regards...