Home > Software design >  How to scrape out all of the items in dictionary - Python
How to scrape out all of the items in dictionary - Python

Time:06-23

I'm working on a 'bookmarks' related app. I have a dictionary data that contains all of the bookmarks from the browser. The input looks like this:

data = {
    "bookmarks_tab": {
        'children': [
            # This parent folder contains multiple nested folders, eg:
            {
                'name': 'nested folder 1',
                'type': 'folder',
                'children': [
                    # Now this nested folder can have multiple nested folders!
                    {
                        'name': 'nested subfolder',
                        'type': 'folder',
                        'children': [
                            # So on and on
                        ]
                    }
                ]
            }

        ],
        'type': 'folder',
        'name': 'bookmarks_tab'
    }
}

What approach should I take to find out how many folders (including nested sub folders) does the input have, including its name. Remember it can literally have any amount of nested folders.

CodePudding user response:

You want a list of the names of all folders including sub-folders? If so, the approach to this is recursively looping through all items, checking if they are a folder and if they are, append their name to a list. For example:

def find_folders(parent: list) -> list:
    # List of names of all folders found
    folders = []

    for folder in parent:
        if folder['type'] == 'folder':
            folders.append(folder.get('name'))
            folders.extend(find_folders(folder.get('children', [])))

    return folders

folders = find_folders(data['bookmarks_tab']['children'])

# Names of all folders
print(folders)

# Number of folders
print(len(folders))

CodePudding user response:

You want something like this. Recursively look through the bookmarks - if you're dealing with a folder then pass it back into the same function, counting the results as you go. Presumably in this use case you won't run into an input-too-large issue.

def process_bookmarks(bookmarks: dict) -> list:
    if bookmarks.get("type") == "folder":
        return 1   sum(process_bookmarks(child) for child in bookmarks.get("children", []))
    else:
        return 0

If you want a list of all the 'folder' items, you can do something similarly:

def process_bookmarks(bookmarks: dict) -> list:
    folders = []
    children = bookmarks.pop("children", [])
    if bookmarks.get("type") == "folder":
        for child in children:
            folders.extend(process_bookmarks(child)
        folders.append(bookmarks)  # Note 'children' key has been removed!
    return folders

This will not return any of the child bookmarks - only the folders. It's easy enough to add an else clause though that appends a child to the output list.

If you're just interested in returning one element of the folder, you can do it one of two ways. Either this on the output of above:

names = [b.get("name") for b in process_bookmarks(bookmarks)]

Which reprocesses the entire list of returned folders, or:

def process_bookmarks(bookmarks: dict) -> List[str]:
    folders = []
    children = bookmarks.pop("children", [])
    if bookmarks.get("type") == "folder":
        for child in children:
            folders.extend(process_bookmarks(child)
        folders.append(bookmarks.get("name"))
    return folders

result = process_bookmarks(data.get("bookmarks_tab", dict()))

Which drops all the data besides the folder 'name'.

Some important notes:

  • Python has a maximum recursion depth. If your bookmarks are nested past this depth, this approach doesn't work.
  • extend() copies the original and extended list into a new list. This means if you start with a list of length 10 and extend it by 1 it takes 11 operations. Similarly, if you start with a list of length 10 and extend it by 10, it takes 20 operations. If you have a structure that has three nested folders, it will take 1 (1 1) (1 2) = 6 operations. For this application this probably doesn't matter - but it might if you are processing gigantic results.
  • If you just return dictionaries, note that your dictionaries will still include the original data. So if I have three nested folders, and I recurse one layer down and return in my result that dictionary appended to a list of the dictionary two layers down, the total data is three dictionaries worth. I mention this not for performance, but because the output looks like this:
dx = [
    {"a": [{"b": ["c": []]]},
    {"b": ["c": []]]},
    {"c": []},
]

Not this:

dx = [{"a": []},
    {"b": []},
    {"c": []},
]

If you go on to use this list for processing (that is, a list of dictionaries describing folders instead of a list of names), be cautious that this extra data doesn't confuse whatever algorithm you're running it through.

  • Related