Separating from a List of Dictionaries in Python with optional parameters-CodePudding

I am attempting to match a very long list of Python dictionaries. What I'm looking for is to append dicts from this list into a new list based on the values of the keys of the dict. An example of what I have is:

A list of 1000 dictionaries structured like this:

{'regions': ['north','south'],
'age':35,
'name':'john',
'cars':['ford','kia']}

I want to sort and match through this list using almost all keys and append the matching ones to a new list. Sometimes I might be searching with only age, where as other times I will be searching with regions & name, all the way to searching with all keys like age, name, regions, & cars because all parameters to search with are optional.

I currently have used for loops to sort through it, but as I add more and more optional parameters, it gets slower and more complex. Is there an easier way to accompany what I am doing?

An example of what the user would send is:

/find regions:north age:10

And it would return a list of all dictionaries with north as a region and age as 10

CodePudding user response：

I made a code to do this

test = [
    {
        'regions': ['south'],
        'age':35,
        'name':'john',
        'cars':['ford']
    },
    {
        'regions': ['north'],
        'age':15,
        'name':'michael',
        'cars':['kia']
    },
    {
        'regions': ['north','south'],
        'age':20,
        'name':'terry',
        'cars':['ford','kia']
    },
    {
        'regions': ['East','south'],
        'age':35,
        'name':'user',
        'cars':['other','kia']
    },
    {
        'regions': ['East','south'],
        'age':75,
        'name':'john',
        'cars':['other']
    }
]



def Finder(inputs: list, regions: list = None, age: int = None, name: str = None, cars: list = None) -> list:
    output = []
    for input in inputs:
        valid = True

        if regions != None and valid: valid = all(i in input["regions"] for i in regions)
        if age != None and valid: valid = input["age"] == age
        if name != None and valid: valid = input["name"] == name
        if cars != None and valid: valid = all(i in input["cars"] for i in cars)

        if valid:
            output.append(input)

    return output

print(Finder(test))
print(Finder(test, age = 25))
print(Finder(test, regions = ["East"]))
print(Finder(test, cars = ["ford","kia"]))
print(Finder(test, name = "john", regions = ["south"]))

This function just checks all parameters and check if the input is valid, and he puts all valid inputs in an output list

CodePudding user response：

I think this one is pretty open ended, especially because you suggest that you want this to be extensible as you add more keys etc. and haven't really discussed your operational requirements. But here are a few thoughts:

Third party modules

Are these dictionaries going to get any more nested? Or is it going to always be 'key -> value' or 'key -> [list, of, values]'?

If you can accept a chunky dependency, you might consider something like pandas, which we normally think of as representing tables, but which can certainly manage nesting to some degree.

For example:

from functools import partial
from pandas import DataFrame
from typing import Dict


def matcher(comparator, target=None) -> bool:
    """
    matcher

    Checks whether a value matches or contains a target
    (won't match if the target is a substring of the value)

    """
    if target == comparator:  # simple case, return True immediately
        return True
    if isinstance(comparator, str):
        return False  # doesn't match exactly and string => no match
    try:  # handle looking in collections
        return target in comparator
    except TypeError:  # if it fails, you know there's no match
        return False


def search(data: DataFrame, query: Dict) -> DataFrame:
    """
    search

    Pass in a DataFrame and a query in the form of a dictionary
    of keys and values to match, for example:
        {"age": 42, "regions": "north", "cars": "ford"}

    Returns a matching subset of the data

    """
    # each element of the resulting list is a boolean series
    # corresponding to a dictionary key
    masks = [
        data[key].map(partial(matcher, target=value)) for key, value in query.items()
    ]

    # collapse the masks down to a single boolean series indicating
    # whether ALL conditions are met for each record
    mask = pd.concat(masks, axis="columns").all(axis="columns")

    return data.loc[mask]


if __name__ == "__main__":

    data = DataFrame(your_big_list)

    query = {"age": 35, "regions": "north", "cars": "ford"}

    results = search(data, query)

    list_results = results.to_dict(orient="records")

Here list_results would restore the filtered data to the original format, if that's important to you.

I found that the matcher function had to be surprisingly complicated, I kept thinking of edge-cases (like: we need to support searching in a collection, but in can also find substrings, which isn't what we want ... unless it is, of course!).

But at least all that logic is walled off in there. You could write a series of unit tests for it, and if you extend your schema in future you can then alter the function accordingly and check the tests still pass.

The search function then is purely for nudging pandas into doing what you want with the matcher.

`match case`

In Python 3.10 the new match case statement might allow you to very cleanly encapsulate the matching logic.

Performance

A pretty fundamental issue here is that if you care about performance (and I got the sense that this was secondary to maintainability for you) then

the bigger the data get, the slower things will be
python is already not fast, generally speaking

You could possibly improve things by building some sort of index for your data. Ultimately, however, it's always going to be more reliable to use a specialist tool. That's going to be some sort of database.

The precise details will depend on your requirements. e.g. are these data going to become horribly unstructured? Are any fields going to be text that will need to be properly indexed in something like Elasticsearch/Solr?

A really light-touch solution that you could implement in the short term with Python would be to

chuck that data into SQLite
rely on SQL for the searching

I am suggesting SQLite since it runs out-of-the-box and just in a single local file:

from sqlalchemy import create_engine

engine = create_engine("sqlite:///mydb.sql")

# ... that's it, we can now connect to a SQLite DB/the 'mydb.sql' file that will be created

... but the drawback is that it won't support array-like data. Your options are:

use postgreSQL instead and take the hit of running a DB with more firepower
normalise those data

I don't think option 2 would be too difficult. Something like:

REGIONS
id | name
----------
1  | north
2  | south
3  | east
4  | west

CUSTOMERS
id | age | name
---------------
...

REGION_LINKS
customer_id | region_id
-----------------------
     1      |     1
     1      |     2

I've called the main data table 'customers' but you haven't mentioned what these data really represent, so that's more by way of example.

Then your SQL queries could get built and executed using sqlalchemy's ORM capabilities.