Home > Mobile >  How to format the string selection?
How to format the string selection?

Time:04-01

I just get to work a python rutine to scrape links from a lot of webpages based in the name of the server, but even when it works, but the output its not in the expected format:

Desired output:

https://www.someserver.com/files/1
https://www.someserver.com/files/2
https://www.someserver.com/files/3....  

Actual output:

[None, '//server.org', '//server.org', '//server.org/recent', '//server.org/popular', '//server.org/trolls', 'https://server.org/software/', 'https://www.serverstore.com', '//server.org/submission', '//server.org/my/login', '//server.org/my/newuser', '//devices.server.org', '//build.server.org', '//entertainment.server.org', '//technology.server.org', '//server.org/?fhfilter=somefilter', '//science.server.org', '//yro.server.org', 'http://rss.server.org/server/serverMain', 'http://www.facebook.com/server', 'https://server.org', '#', '//server.org/blog', '#', '#', '#', '//server.org']   

So how can customize the concatenation to get the format as intended instead of //server.org, or how to format the soup.findAll and the append.

Thanks so much.

CODE

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import re

req = Request("https://somepagewithlinks.com")
html_page = urlopen(req)

soup = BeautifulSoup(html_page, "lxml")

links = []
for link in soup.findAll('a'):
    links.append(link.get('href'))

print(links)

file = open("lk", "w")
lista = repr(links)
file.write(str(links))
file.close

UPDATE
Thanks to uingtea, but I get lost since changing link/links instructions fails and shows error related to

 file.close
<built-in method close of _io.TextIOWrapper object at 0x7ffe8ec74b40>

And when using file.close() it makes a empty file. I understand there must be defined a list (links) and after that it should be referenced to links.instruction(). What I'm missing?

CodePudding user response:

check the string start

for link in soup.findAll('a'):
    link = link.get('href')
    if link.startswith('//'):
        link= 'https:'   link
    elif link.startswith('#'):
        link= 'https/domainname/'   link

    links.append(link)
  • Related