python: Dictionary counter items. How to optimize performance-CodePudding

I need to keep track of occurrences of items in parsed files in a structure as follows:

data= {'tomate': {'John': 2, 'mike': 2},
 'pasta': {'mike': 1},
 'wine': {'mike': 1, 'alex': 2}}

the dictionary starts as empty one. data = {} and its is being populated with values when they are not there and adding up users 1 if the user is there or =1 if the user is not yet there.

This is my code:

def modify(data, key, client):
try:
    # if both keys are there
    data[key][client] = data[key][client]  1
    #print(f"{key} is there. {client} is there.")
except:
    try: 
        data[key][client] = 1
        #print(f"{client} not there, but {key}")
    except:
        data[key]={}
        data[key][client] = 1
        #print(f"{client} and {key} added")
return data

It works:

key="wine"
client = "alex"
modify(d, key, client)

giving:

{'tomate': {'John': 2, 'mike': 2},
 'pasta': {'mike': 1},
 'wine': {'mike': 1, 'alex': 3}}

The question is if using try/except is not the way to go, i.e. not pythonic, because I am relying on exceptions for the code to work properly which I feel is a little bit weird.

Is there any other option to keep track of a counter like this that I might have a look to? Peformance is very important of course since I have to count hundreds of millions of times.

EDIT 1: Evaluating the speed of a proposed solution I have this comparison that strikes me since my solution is faster (and I would not expect that):

CodePudding user response：

One step to a more pythonic solution is to catch exception that you want to catch (instead of any exception):

try:
   ...
except KeyError:
   ...

Second, you can check whether the element is in the dictionary instead of trying:

if key not in data:
    data[key] = {}
if client not in data[key]:
    data[key][client] = 0
data[ket][client]  = 1

You can also use defaultdict that does these checks for you. In my experience and experiments all these solutions a very close in terms of execution time so choose which one looks better for you.

CodePudding user response：

You can rewrite the function using if-else or using dict.setdefault and get rid of Exceptions:

def modify(data, key, client):
    data.setdefault(key, {}).setdefault(client, 0)
    data[key][client]  = 1


data = {}

modify(data, "tomate", "John")
modify(data, "tomate", "John")
modify(data, "tomate", "mike")
modify(data, "tomate", "mike")
modify(data, "pasta", "mike")
modify(data, "wike", "mike")
modify(data, "wike", "alex")
modify(data, "wike", "alex")

print(data)

Prints:

{
    "tomate": {"John": 2, "mike": 2},
    "pasta": {"mike": 1},
    "wike": {"mike": 1, "alex": 2},
}

Using if-else:

def modify(data, key, client):
    if key not in data:
        data[key] = {}

    if client in data[key]:
        data[key][client]  = 1
    else:
        data[key][client] = 1

EDIT: One-liner version:

def modify(data, key, client):
    x[client] = (x := data.setdefault(key, {})).get(client, 0)   1

CodePudding user response：

With try/except version, the search for index is done only once (most of the time). It serves both the purpose of checking the presence of index, and altering the associated value. That is why your version is the fastest so far. Last one-liner from Andrej is very fast also. Especially if dealing with lot of new values (for which your try/except will have several indexation trials, whereas his use 3 indexation each times).

One notable, seemingly minor but important improvement to your version could be:

def modify(data, key, client):
try:
    # if both keys are there
    data[key][client]  = 1
    #print(f"{key} is there. {client} is there.")
except:
    try: 
        data[key][client] = 1
        #print(f"{client} not there, but {key}")
    except:
        data[key]={}
        data[key][client] = 1
        #print(f"{client} and {key} added")
return data

Yes, just that = changes quite a lot. Especially with lot of "updates" (I mean, not new index). Because x=x 1 means that python needs to figure out what is x twice. Whereas x =1 finds the l-value only once. Which is roughly the same usually. Except when finding the place itself is a non-negligible part of the cost. As it is here, with indexation