I'm trying to create a webscraping script in Python where I follow a bunch of links and insert them into a .txt file. However, I want to do this only if the website already doesn't exist in the file.
I have written this code to insert the given website link into the file, so far (not working):
def writeSite(site):
file = open("websites.txt", 'a ')
# print(site)
if site in file.read():
return
file.write(site "\n")
file.close()
Thanks in advance.
CodePudding user response:
You were pretty close, but because you open the file to append to it, it starts with the file pointer at the end. You need to seek to the start to read its contents again:
def writeSite(site):
file = open("websites.txt", 'a ')
file.seek(0)
# print(site)
if site in file.read():
return
file.write(site "\n")
file.close()
However, keep in mind that site in file.read()
is very crude.
For example, imagine you already have 'http://somesite.com/page/'
in the file but now you want to add 'http://somesite.com/'
- the URL as a whole is not in the file, but your test will find it.
If you want to check whole lines (and be sure you deal with the file nicely), this would be better:
def writeSite(site):
site = '\n'
with open("websites.txt", 'a ') as f:
f.seek(0)
if site in f.readlines():
return
f.write(site)
It adds a newline to the name of the site to separate the URLs in the file and uses readlines to make use of that fact to check for the whole URL. Using with
ensures the file always gets closed.
And since you want to read before writing anyway, you could use 'r '
as a mode, and skip the seek - but only if you can be sure the file already exists. I assume you chose 'a '
because that isn't the case.
(in case you worry that this changes the value of site
- that's only true for the parameter inside the function. Whatever value you passed in outside the function will remain unaffected)
CodePudding user response:
I'm new to python but I think this will help you. Steps for writing to text files