I got this file with this 3 urls insite urls.tmp file:
https://site1.com.br/wp-content/uploads/2020/06/?SD
https://site2.com.br/wp-content/uploads/tp-datademo/home-4/data/tp-hotel-booking/?SD
https://site3.com.br/wp-content/uploads/revslider/hotel-home/?MD
I want to remove everything after "com.br/" of each.
I tryed this code:
# open the file
sys.stdout = open("urls.tmp", "w")
# start remove
for i in "urls.tmp":
url_parts = urllib.parse.urlparse(i)
result = '{uri.scheme}://{uri.netloc}/'.format(uri=url_parts)
print(result) #overwrite the file
# close the file
sys.stdout.close()
But the output gave me this stranger thing:
:///
:///
:///
:///
:///
:///
:///
:///
I'm beginner, what wrong i'm doing ?
CodePudding user response:
You're iterating over "urls.tmp"
string itself, but want to go through opened file object, line by line.
So try this instead:
with open("urls.tmp", "r") as urls_file:
for line in urls_file:
url_parts = urllib.parse.urlparse(line)
result = "{uri.scheme}://{uri.netloc}/".format(uri=url_parts)
print(result)
Edit: the author updated the original question mentioning that the source file contents should be rewritten with processed urls, here's the example:
new_urls = []
with open("urls.tmp", "r") as urls_file:
old_urls = urls_file.readlines()
for line in old_urls:
url_parts = urllib.parse.urlparse(line)
proc_url = "{uri.scheme}://{uri.netloc}/\n".format(uri=url_parts)
new_urls.append(proc_url)
with open("urls.tmp", "w") as urls_file:
urls_file.writelines(new_urls)
CodePudding user response:
See Savva Surenkov answer to solve your issue.
You can use the split method of strings like:
url = r"https://site1.com.br/wp-content/uploads/2020/06/?SD"
split_by = "com.br/"
new_url = url.split(split_by)[0] split_by
# this gives you the part before <split_by> and then we can attach it again
new_url == r"https://site1.com.br"
If you want to add some additional checks, you might look into regular expressions.
Things you did not ask for but might help you as a beginner. I recommend using
with open("urls.tmp", "w") as f:
# do something with f
or
import pathlib
urls = pathlib.Path("urls.tmp").read_text()
# which gives you all lines in single string
over plain open
. If you want to know more about that I recommend looking into context managers.
Also there is f-strings
since Python 3.6 which are in my opinion easier to read than "{}".format
.
CodePudding user response:
You can proceed with the find() method of string.
urllist=[
'https://site1.com.br/wp-content/uploads/2020/06/?SD',
'https://site2.com.br/wp-content/uploads/tp-datademo/home-4/data/tp-hotel-booking/?SD',
'https://site3.com.br/wp-content/uploads/revslider/hotel-home/?MD']
newlist=[]
breaktext='com.br/'
for item in urllist:
position=item.find(breaktext)
newlist.append(item[:position len(breaktext)])
print (newlist)