I am attempting to match a very long list of Python dictionaries. What I'm looking for is to append dicts from this list into a new list based on the values of the keys of the dict. An example of what I have is:
A list of 1000 dictionaries structured like this:
{'regions': ['north','south'],
'age':35,
'name':'john',
'cars':['ford','kia']}
I want to sort and match through this list using almost all keys and append the matching ones to a new list. Sometimes I might be searching with only age
, where as other times I will be searching with regions
& name
, all the way to searching with all keys like age
, name
, regions
, & cars
because all parameters to search with are optional.
I currently have used for loops to sort through it, but as I add more and more optional parameters, it gets slower and more complex. Is there an easier way to accompany what I am doing?
An example of what the user would send is:
/find regions:north age:10
And it would return a list of all dictionaries with north as a region and age as 10
CodePudding user response:
I made a code to do this
test = [
{
'regions': ['south'],
'age':35,
'name':'john',
'cars':['ford']
},
{
'regions': ['north'],
'age':15,
'name':'michael',
'cars':['kia']
},
{
'regions': ['north','south'],
'age':20,
'name':'terry',
'cars':['ford','kia']
},
{
'regions': ['East','south'],
'age':35,
'name':'user',
'cars':['other','kia']
},
{
'regions': ['East','south'],
'age':75,
'name':'john',
'cars':['other']
}
]
def Finder(inputs: list, regions: list = None, age: int = None, name: str = None, cars: list = None) -> list:
output = []
for input in inputs:
valid = True
if regions != None and valid: valid = all(i in input["regions"] for i in regions)
if age != None and valid: valid = input["age"] == age
if name != None and valid: valid = input["name"] == name
if cars != None and valid: valid = all(i in input["cars"] for i in cars)
if valid:
output.append(input)
return output
print(Finder(test))
print(Finder(test, age = 25))
print(Finder(test, regions = ["East"]))
print(Finder(test, cars = ["ford","kia"]))
print(Finder(test, name = "john", regions = ["south"]))
This function just checks all parameters and check if the input is valid, and he puts all valid inputs in an output list
CodePudding user response:
I think this one is pretty open ended, especially because you suggest that you want this to be extensible as you add more keys etc. and haven't really discussed your operational requirements. But here are a few thoughts:
Third party modules
Are these dictionaries going to get any more nested? Or is it going to always be 'key -> value' or 'key -> [list, of, values]'?
If you can accept a chunky dependency, you might consider something like pandas
, which we normally think of as representing tables, but which can certainly manage nesting to some degree.
For example:
from functools import partial
from pandas import DataFrame
from typing import Dict
def matcher(comparator, target=None) -> bool:
"""
matcher
Checks whether a value matches or contains a target
(won't match if the target is a substring of the value)
"""
if target == comparator: # simple case, return True immediately
return True
if isinstance(comparator, str):
return False # doesn't match exactly and string => no match
try: # handle looking in collections
return target in comparator
except TypeError: # if it fails, you know there's no match
return False
def search(data: DataFrame, query: Dict) -> DataFrame:
"""
search
Pass in a DataFrame and a query in the form of a dictionary
of keys and values to match, for example:
{"age": 42, "regions": "north", "cars": "ford"}
Returns a matching subset of the data
"""
# each element of the resulting list is a boolean series
# corresponding to a dictionary key
masks = [
data[key].map(partial(matcher, target=value)) for key, value in query.items()
]
# collapse the masks down to a single boolean series indicating
# whether ALL conditions are met for each record
mask = pd.concat(masks, axis="columns").all(axis="columns")
return data.loc[mask]
if __name__ == "__main__":
data = DataFrame(your_big_list)
query = {"age": 35, "regions": "north", "cars": "ford"}
results = search(data, query)
list_results = results.to_dict(orient="records")
Here list_results
would restore the filtered data to the original format, if that's important to you.
I found that the matcher
function had to be surprisingly complicated, I kept thinking of edge-cases (like: we need to support searching in a collection, but in
can also find substrings, which isn't what we want ... unless it is, of course!).
But at least all that logic is walled off in there. You could write a series of unit tests for it, and if you extend your schema in future you can then alter the function accordingly and check the tests still pass.
The search
function then is purely for nudging pandas
into doing what you want with the matcher
.
match case
In Python 3.10 the new match case
statement might allow you to very cleanly encapsulate the matching logic.
Performance
A pretty fundamental issue here is that if you care about performance (and I got the sense that this was secondary to maintainability for you) then
- the bigger the data get, the slower things will be
- python is already not fast, generally speaking
You could possibly improve things by building some sort of index for your data. Ultimately, however, it's always going to be more reliable to use a specialist tool. That's going to be some sort of database.
The precise details will depend on your requirements. e.g. are these data going to become horribly unstructured? Are any fields going to be text that will need to be properly indexed in something like Elasticsearch/Solr?
A really light-touch solution that you could implement in the short term with Python would be to
- chuck that data into SQLite
- rely on SQL for the searching
I am suggesting SQLite since it runs out-of-the-box and just in a single local file:
from sqlalchemy import create_engine
engine = create_engine("sqlite:///mydb.sql")
# ... that's it, we can now connect to a SQLite DB/the 'mydb.sql' file that will be created
... but the drawback is that it won't support array-like data. Your options are:
- use postgreSQL instead and take the hit of running a DB with more firepower
- normalise those data
I don't think option 2 would be too difficult. Something like:
REGIONS
id | name
----------
1 | north
2 | south
3 | east
4 | west
CUSTOMERS
id | age | name
---------------
...
REGION_LINKS
customer_id | region_id
-----------------------
1 | 1
1 | 2
I've called the main data table 'customers' but you haven't mentioned what these data really represent, so that's more by way of example.
Then your SQL queries could get built and executed using sqlalchemy
's ORM capabilities.