Home > Back-end >  How do I hierarchically sort URLs in python?
How do I hierarchically sort URLs in python?

Time:12-17

I've been banging my head on this problem for a week, and I can't figure it out.

Given an initial list of URLs crawled from a site:

https://somesite.com/
https://somesite.com/advertise
https://somesite.com/articles
https://somesite.com/articles/read
https://somesite.com/articles/read/1154
https://somesite.com/articles/read/1155
https://somesite.com/articles/read/1156
https://somesite.com/articles/read/1157
https://somesite.com/articles/read/1158
https://somesite.com/blogs

I am trying to turn the list into a tab-organized tree hierarchy:

https://somesite.com
    /advertise
    /articles
        /read
            /1154
            /1155
            /1156
            /1157
            /1158
    /blogs

I've tried using lists, tuples, and dictionaries. So far I have figured out two flawed ways to output the content.

Method 1 will miss elements if they have the same name and position in the hierarchy:

Input:
https://somesite.com
https://somesite.com/missions
https://somesite.com/missions/playit
https://somesite.com/missions/playit/extbasic
https://somesite.com/missions/playit/extbasic/0
https://somesite.com/missions/playit/stego
https://somesite.com/missions/playit/stego/0
Output:
https://somesite.com/
    /missions
        /playit
            /extbasic
                /0
            /stego

----------------^ Missing expected output "/0"

Method 2 will not miss any elements, but it will print redundant content:

Input:
https://somesite.com
https://somesite.com/missions
https://somesite.com/missions/playit
https://somesite.com/missions/playit/extbasic
https://somesite.com/missions/playit/extbasic/0
https://somesite.com/missions/playit/stego
https://somesite.com/missions/playit/stego/0
Output:
https://somesite.com/
    /missions
        /playit
            /extbasic
                /0
    /missions       <- Redundant content
        /playit     <- Redundant content
            /stego      
                /0

I'm reaching my wit's end on how to properly do this, and my googling has only turned up references to urllib that don't seem to be what I need.

I'm sure there is a much better approach, but I have been unable to find it.

My code for getting the content into a usable list:

#!/usr/bin/python3

import re

# Read the original list of URLs from file
with open("sitelist.raw", "r") as f:
    raw_site_list = f.readlines()

# Extract the prefix and domain from the first line
first_line = raw_site_list[0]
prefix, domain = re.match("(http[s]://)(.*)[/]" , first_line).group(1, 2)

# Remove instances of prefix and domain, and trailing newlines, drop any lines that are only a slash
clean_site_list = []
for line in raw_site_list:
    clean_line = line.strip(prefix).strip(domain).strip()
    if not clean_line == "/":
        if not clean_line[len(clean_line) - 1] == "/":
            clean_site_list  = [clean_line]

# Split the resulting relative paths into their component parts and filter out empty strings
split_site_list = []
for site in clean_site_list:
    split_site_list  = [list(filter(None, site.split("/")))]

This gives a list to manipulate, but I've run out of ideas on how to output it without losing elements or outputting redundant elements.

Thanks

CodePudding user response:

This works with your sample data:

urls = ['https://somesite.com',
        'https://somesite.com/missions',
        'https://somesite.com/missions/playit',
        'https://somesite.com/missions/playit/extbasic',
        'https://somesite.com/missions/playit/extbasic/0',
        'https://somesite.com/missions/playit/stego',
        'https://somesite.com/missions/playit/stego/0']


base = urls[0]
print(base)
tabdepth = 0
tlen = len(base.split('/'))

for url in urls[1:]:
    t = url.split('/')
    lt = len(t)
    if lt != tlen:
        tabdepth  = 1 if lt > tlen else -1
        tlen = lt
    pad = ''.join(['    ' for _ in range(tabdepth)])
    print(f'{pad}/{t[-1]}')

CodePudding user response:

This code will help you in your task. I agree this code might be a bit large and might contain some redundant codes and checks but this will create a dictionary containing hierarchy of the urls, you can use that dictionary however you like, print it or store it.

More over this code will also parse different urls and create a seprate tree of them (see code and output)

EDIT: This will also take care of the redundant urls

Code:

    from json import dumps


def process_urls(urls: list):
    tree = {}

    for url in urls:
        url_components = url.split("/")
        # First three components will be the protocol
        # an empty entry
        # and the base domain 
        base_domain = url_components[:3]
        base_domain = base_domain[0]   "//"   "".join(base_domain[1:])
        # Add base domain to tree if not there.
        try:
            tree[base_domain]
        except:
            tree[base_domain] = {}

        structure = url_components[3:]
        
        for i in range(len(structure)):
            # add the first element
            if i == 0 :
                try:
                    tree[base_domain]["/" structure[i]]
                except:
                    tree[base_domain]["/" structure[i]] = {}
            else:
                base = tree[base_domain]["/" structure[0]]
                for j in range(1, i):
                    base = base["/" structure[j]]

                try:
                    base["/" structure[i]]
                except:
                    base["/" structure[i]] = {}

    return tree


def print_tree(tree: dict, depth=0):
    for key in tree.keys():
        print("\t"*depth key)

        # redundant checks
        if type(tree[key]) == dict:
            
            # if dictionary is empty then do nothing
            # else call this function recuressively
            # increase depth by 1
            if tree[key]:
                print_tree(tree[key], depth 1)


if __name__ == "__main__":
        urls = [
            'https://somesite.com',
            'https://somesite.com/missions',
            'https://somesite.com/missions/playit',
            'https://somesite.com/missions/playit/extbasic',
            'https://somesite.com/missions/playit/extbasic/0',
            'https://somesite.com/missions/playit/extbasic/0',
            'https://somesite.com/missions/playit/extbasic/0',
            'https://somesite.com/missions/playit/extbasic/0',
            'https://somesite.com/missions/playit/stego',
            'https://somesite.com/missions/playit/stego/0',
            'https://somesite2.com/missions/playit',
            'https://somesite2.com/missions/playit/extbasic',
            'https://somesite2.com/missions/playit/extbasic/0',
            'https://somesite2.com/missions/playit/stego',
            'https://somesite2.com/missions/playit/stego/0'
        ]
    tree = process_urls(urls)
    print_tree(tree)

Output:

https://somesite.com
    /missions
            /playit
                    /extbasic
                            /0
                    /stego
                            /0
https://somesite2.com
    /missions
            /playit
                    /extbasic
                            /0
                    /stego
                            /0
  • Related