I just get to work a python rutine to scrape links from a lot of webpages based in the name of the server, but even when it works, but the output its not in the expected format:
Desired output:
https://www.someserver.com/files/1
https://www.someserver.com/files/2
https://www.someserver.com/files/3....
Actual output:
[None, '//server.org', '//server.org', '//server.org/recent', '//server.org/popular', '//server.org/trolls', 'https://server.org/software/', 'https://www.serverstore.com', '//server.org/submission', '//server.org/my/login', '//server.org/my/newuser', '//devices.server.org', '//build.server.org', '//entertainment.server.org', '//technology.server.org', '//server.org/?fhfilter=somefilter', '//science.server.org', '//yro.server.org', 'http://rss.server.org/server/serverMain', 'http://www.facebook.com/server', 'https://server.org', '#', '//server.org/blog', '#', '#', '#', '//server.org']
So how can customize the concatenation to get the format as intended instead of //server.org, or how to format the soup.findAll
and the append
.
Thanks so much.
CODE
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import re
req = Request("https://somepagewithlinks.com")
html_page = urlopen(req)
soup = BeautifulSoup(html_page, "lxml")
links = []
for link in soup.findAll('a'):
links.append(link.get('href'))
print(links)
file = open("lk", "w")
lista = repr(links)
file.write(str(links))
file.close
UPDATE
Thanks to uingtea, but I get lost since changing link/links instructions fails and shows error related to
file.close
<built-in method close of _io.TextIOWrapper object at 0x7ffe8ec74b40>
And when using file.close()
it makes a empty file. I understand there must be defined a list (links) and after that it should be referenced to links.instruction(). What I'm missing?
CodePudding user response:
check the string start
for link in soup.findAll('a'):
link = link.get('href')
if link.startswith('//'):
link= 'https:' link
elif link.startswith('#'):
link= 'https/domainname/' link
links.append(link)