Home > other >  How do I remove duplicated domains and invalid HTML files from a list in python?
How do I remove duplicated domains and invalid HTML files from a list in python?

Time:09-13

I have a list of urls. How do I remove the urls if it is not a valid html for example I want to remove pdf/jpy files and also want to remove duplicated domains.

example_list = [
'https://ocp.dc.gov/sites/default/files/dc/sites/ocp/publication/attachments/Report-of-Contracting-Activity-Part-I.pdf', 
'https://the1955club.com/', 
'https://the1955club.com/aboutus']

so in the new list it should return the below

new_list = ['https://the1955club.com/']

CodePudding user response:

First: in Python it is better to create new list with elements which you want to keep and later assign it to old variable.


You can use urllib.parse.urlparse(url) to split url and check if .path is "" or "/"

import urllib.parse

example_list = [
    'https://ocp.dc.gov/sites/default/files/dc/sites/ocp/publication/attachments/Report-of-Contracting-Activity-Part-I.pdf', 
    'https://the1955club.com/', 
    'https://the1955club.com/aboutus'
]

new_list = []

for url in example_list:
    parts = urllib.parse.urlparse(url)
    if parts.path in ("", "/"):
        new_list.append(url)
        
example_list = new_list

print(example_list)

EDIT:

But if you want to filter urls which really are not valid then you may need to use requests and try to get this url. If you don't get result then url is wrong.

CodePudding user response:

This will check both the url and the file path, if it finds a repeated URL or a bad filepath (ending with .pdf, .jpeg, etc.) it will ignore it.

import re

valid_domains = set()

example_list = [
'https://ocp.dc.gov/sites/default/files/dc/sites/ocp/publication/attachments/Report-of-Contracting-Activity-Part-I.pdf', 
'https://the1955club.com/', 
'https://the1955club.com/aboutus'
'http://the1955club.com/aboutus']

print()

for example in example_list:
    res = re.search('http[s]??:\/\/(. ?)(\/.*)', example)
    extension = re.findall('^.*\.(jpg|JPG|gif|GIF|doc|DOC|pdf|PDF|jpy|JPY)$', res.group(2))
    if res.group(1) in valid_domains or len(extension):
        continue

    valid_domains.add(res.group(1))
    
print(valid_domains)
    
  • Related