For some reason .write() does not write string to .txt file anymore-CodePudding

So I started with python some days ago and now tried to make a function that gives me all subpages of websites. I know it may not be the most elegant function but I had been pretty proud to see it working. But for some reason unknown to me, my function does not work anymore. I could've sworn I haven't changed that function since it worked the last time. But after hours of attempts to debug I am slowly doubting myself. Can you maybe take a look why my function does not output to a .txt file anymore? I just get handed an empty text file. Though if I delete it atleast creates a new (empty) one.

I tried to move the save strings part out of the try block, which didn't. work. I also tried all_urls.flush() to maybe save everything. I restarted the PC in the hopes that something in the background accessed the file and made me unable to write on it. I also renamed the file it supposed to save as, so as to generate something truly fresh. Still the same problem. I also controlled that the link from the loop gets given as a string, so that shouldn't be a problem. I also tried:

print(link, file=all_urls, end='\n')

as a replacement to

all_urls.write(link)
all_urls.write('\n')

with no result.

My full function:

def get_subpages(url):
    # gets all subpage links from a website that start with the given url
    from urllib.request import urlopen, Request
    from bs4 import BeautifulSoup
    links = [url]
    tested_links = []
    to_test_links = links
    # open a .txt file to save results into
    all_urls = open('all_urls.txt', 'w')
    problematic_pages = open('problematic_pages.txt', 'w')
    while len(to_test_links)>0:
        for link in to_test_links:
            print('the link we are testing right now:', link)
            # add the current link to the tested list
            tested_links.append(link)
            try:
                print(type(link))
                all_urls.write(link)
                all_urls.write('\n')
                # Save it to the -txt file and make an abstract
                # get the link ready to be accessed
                req = Request(link)
                html_page = urlopen(req)
                soup = BeautifulSoup(html_page, features="html.parser")
                # empty previous temporary links
                templinks = []
                # move the links on the subpage link to templinks
                for sublink in soup.findAll('a'):
                    templinks.append(sublink.get('href'))
                # clean off accidental 'None' values
                templinks = list(filter(lambda item: item is not None, templinks))

                for templink in templinks:
                    # make sure we have still the correct website and don't accidentally crawl instagram etc.
                    # also avoid duplicates
                    if templink.find(url) == 0 and templink not in links:
                        links.append(templink)

                #and lastly refresh the to_test_links list with the newly found links before going back into the loop
                to_test_links = (list(set(links) ^ set(tested_links)))
            except:
                # Save it to the ERROR -txt file and make an abstract
                problematic_pages.write(link)
                problematic_pages.write('\n')
                print('ERROR: All links on', link, 'not retrieved. If need be check for new subpages manually.')
    all_urls.close()
    problematic_pages.close()
    return links

CodePudding user response：

I can't reproduce this, but I've had inexplicable [to me at least] errors with file handling that were resolved when I wrote from inside a with.

[Just make sure to remove the lines involving allurl in your current code first just in case - or just try this with a different filename while checking if it works]

Since you're appending all the urls to tested_links anyway, you could just write it all at once after the while loop

with open('all_urls.txt', 'w') as f:
    f.write('\n'.join('tested_links') '\n')

or, if you have to write link by link, you can append by opening with mode='a':

# before the while, if you're not sure the file exists
# [and/or to clear previous data from file]
# with open('all_urls.txt', 'w') as f: f.write('')

                # and inside the try block: 
                with open('all_urls.txt', 'a') as f:                 
                    f.write(f'{link}\n')

CodePudding user response：

not a direct answer, but in my early days this happened with me. The requests module of Python sends request with headers indicating python and that could be quickly detected by websites and your IP can get blocked and you are getting unusual response that's why your working function is not working now.

Solution:

Use natural request headers see the code below

headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'}
r = requests.get(URL, headers=headers)

Use a proxy in case you got blocked on your IP it is highly recommended

CodePudding user response：

Here is your slitely changed script with marked (*****************) changes:

def get_subpages(url):
    # gets all subpage links from a website that start with the given url
    from urllib.request import urlopen, Request
    from bs4 import BeautifulSoup
    links = [url]
    
    # ******************* added sublinks_list variable   ******************
    sublinks_list = []
    
    tested_links = []
    to_test_links = links
    # open a .txt file to save results into
    all_urls = open('all_urls.txt', 'w')
    problematic_pages = open('problematic_pages.txt', 'w')
    while len(to_test_links)>0:
        for link in to_test_links:
            print('the link we are testing right now:', link)
            # add the current link to the tested list
            tested_links.append(link)
            try:
                all_urls.write(link)
                all_urls.write('\n')
                # Save it to the -txt file and make an abstract
                # get the link ready to be accessed
                req = Request(link)
                html_page = urlopen(req)
                soup = BeautifulSoup(html_page, features="html.parser")
                # empty previous temporary links
                templinks = []
                # move the links on the subpage link to templinks
                sublinks = soup.findAll('a')

                for sublink in sublinks:
                
                    #templinks.append(sublink.get('href'))  ***************** changed the line with next row *****************
                    templinks.append(sublink['href'])
                    
                # clean off accidental 'None' values
                templinks = list(filter(lambda item: item is not None, templinks))

                for templink in templinks:
                    # make sure we have still the correct website and don't accidentally crawl instagram etc.
                    # also avoid duplicates
                    
                    #if templink.find(url) == 0 and templink not in links:  ******************* changed the line with next row *****************
                    if templink not in sublinks_list:
                    
                        #links.append(templink)      ******************* changed the line with next row *****************
                        sublinks_list.append(templink)
                        
                        all_urls.write(templink   '\n')     # ******************* added this line  *****************

                #and lastly refresh the to_test_links list with the newly found links before going back into the loop
                to_test_links = (list(set(links) ^ set(tested_links)))

            except:
                # Save it to the ERROR -txt file and make an abstract
                problematic_pages.write(link)
                problematic_pages.write('\n')
                print('ERROR: All links on', link, 'not retrieved. If need be check for new subpages manually.')

    all_urls.close()
    problematic_pages.close()
    return links

lnks = get_subpages('https://www.jhanley.com/blog/pyscript-creating-installable-offline-applications/')  #  # ******************* url used for testing  *****************

It works and there is over 180 links in the file. Please test it yourself. There are still some missfits and questionable sintax so you should test your code thoroughly again - but the part with writing links into a file works.

Regards...