I have a list of dictionaries, like this:
[{'user': '123456', 'db': 'db1', 'size': '8628'}
{'user': '123456', 'db': 'db1', 'size': '7168'}
{'user': '123456', 'db': 'db1', 'size': '38160'}
{'user': '222345', 'db': 'db3', 'size': '8628'}
{'user': '222345', 'db': 'db3', 'size': '8628'}
{'user': '222345', 'db': 'db5', 'size': '840'}
{'user': '34521', 'db': 'db6', 'size': '12288'}
{'user': '34521', 'db': 'db6', 'size': '476'}
{'user': '2345156', 'db': 'db7', 'size': '5120'}.....]
This list contains millions of entries. Each user can be found in multiple dbs, each user can have multiple entires in the same db. I want to sum up how much is the size occupied by each user, per each db. I don't want to use pandas. At the moment I do it this way:
- I create 2 lists of unique users and unique dbs
- Use those lists to iterate through the big list and sum up where user and db are the same
result = []
for user in unique_users:
for db in unique_dbs:
total_size = 0
for i in big_list:
if (i['user'] == user and i['db'] == db):
total_size = float(i['size'])
if(total_size) > 0:
row = {}
row['user'] = user
row['db'] = db
row['size'] = total_size
result.append(row)
The problem is that this triple for loop develops into something very large (hundreds of billions of iterations) which takes forever to sum up the result. If the big_list is small, this works very well.
How should I approach this in order to keep it fast and simple? Thanks a lot!
CodePudding user response:
Try mapping user to db to the total size in a dictionary. It will require additional memory but should be faster to access & requires only a single pass through the data:
user_to_db_to_size = {}
for entry in unique_users:
user = entry['user']
db = entry['db']
size = int(entry['size'])
if user not in user_to_db_to_size:
user_to_db_to_size[user] = {}
if db not in user_to_db_to_size[user]:
user_to_db_to_size[user][db] = 0
user_to_db_to_size[user][db] = size
print(user_to_db_to_size)
For your sample data it produces:
{'123456': {'db1': 53956}, '222345': {'db3': 17256, 'db5': 840}, '34521': {'db6': 12764}, '2345156': {'db7': 5120}}
And now you can access the total size per user/db with:
print(user_to_db_to_size['123456']['db1']) # 53956
CodePudding user response:
There are two main issue with the current approach: the inefficient algorithm and the inefficient data structure.
The first is that the algorithm used is clearly inefficient as it iterates many times over the big list. There is not need to iterate over the whole list to filter a unique user and db. You can iterate over the big list once and aggregate data using a dictionary. The key of the target dictionary is simply a (user, db)
tuple. The value of the dictionary is total_size
. Here is an untested example:
# Aggregation part
# Note: a default dict can be used instead to make the code possibly simpler
aggregate_dict = dict()
for i in big_list:
key = (i['user'], i['db'])
value = float(i['size'])
if key in aggregate_dict:
aggregate_dict[key] = value
else:
aggregate_dict[key] = value
# Fast creation of `result`
result = []
for user in unique_users:
for db in unique_dbs:
total_size = aggregate_dict.get((user, key))
if total_size is not None and total_size > 0:
result.append({'user': user, 'db': db, 'size': total_size})
The other issue is the inefficient data structure: for each row, the keys are replicated while tuples can be used instead. In fact, a better data structure is to store a dictionary of (column, items)
key-values where items
is a list of items for the target column. This way of storing data is called a dataframe. This is roughly what Pandas uses internally (except it is a Numpy array which is even better as it is more compact and generally more efficient than a list for most operations). Using this data structure for both the input and the output should result in a significant speed up (if combined with Numpy) and a lower memory footprint.