I have a list of urls. How do I remove the urls if it is not a valid html for example I want to remove pdf/jpy files and also want to remove duplicated domains.
example_list = [
'https://ocp.dc.gov/sites/default/files/dc/sites/ocp/publication/attachments/Report-of-Contracting-Activity-Part-I.pdf',
'https://the1955club.com/',
'https://the1955club.com/aboutus']
so in the new list it should return the below
new_list = ['https://the1955club.com/']
CodePudding user response:
First: in Python it is better to create new list with elements which you want to keep and later assign it to old variable.
You can use urllib.parse.urlparse(url)
to split url and check if .path
is ""
or "/"
import urllib.parse
example_list = [
'https://ocp.dc.gov/sites/default/files/dc/sites/ocp/publication/attachments/Report-of-Contracting-Activity-Part-I.pdf',
'https://the1955club.com/',
'https://the1955club.com/aboutus'
]
new_list = []
for url in example_list:
parts = urllib.parse.urlparse(url)
if parts.path in ("", "/"):
new_list.append(url)
example_list = new_list
print(example_list)
EDIT:
But if you want to filter urls which really are not valid then you may need to use requests
and try to get this url. If you don't get result then url is wrong.
CodePudding user response:
This will check both the url and the file path, if it finds a repeated URL or a bad filepath (ending with .pdf, .jpeg, etc.) it will ignore it.
import re
valid_domains = set()
example_list = [
'https://ocp.dc.gov/sites/default/files/dc/sites/ocp/publication/attachments/Report-of-Contracting-Activity-Part-I.pdf',
'https://the1955club.com/',
'https://the1955club.com/aboutus'
'http://the1955club.com/aboutus']
print()
for example in example_list:
res = re.search('http[s]??:\/\/(. ?)(\/.*)', example)
extension = re.findall('^.*\.(jpg|JPG|gif|GIF|doc|DOC|pdf|PDF|jpy|JPY)$', res.group(2))
if res.group(1) in valid_domains or len(extension):
continue
valid_domains.add(res.group(1))
print(valid_domains)