Home > other >  Dictionary Comprehension Runtime Question
Dictionary Comprehension Runtime Question

Time:06-08

I have a question about how to possibly optimize a dictionary comprehension.

Setup:

I'm dealing with looping over XML minidom objects. Along with importing xml.dom.minidom I've defined a convenience function, getXMLDatum, in a custom package like so:

def getXMLDatum(container, elName):
        return container.getElementsByTagName(elName)[0].firstChild.nodeValue

getXMLDatum simply "returns the value between the brackets" of a given (parent, child) couple in the XML document.

I have a list of distinct accounts that's around 300 items in size

In [1] len(accountNums)
Out[1] 285

I have a list of XML minidom objects that are denoted as "charge details", each of which contain a reference to exactly one account in accountNums

In [2] getXMLDatum(chargeDetails[0], "SECONDARY_ACCT")
Out[2] '00018608'

There are around 1500 charge details:

In [3] len(chargeDetails)
Out[3] 1426

I'm trying to create a mapping where I can plug in an account and get all the charge details associated with it:

In [4] accountMap = { account : [cd for cd in chargeDetails if getXMLDatum(cd, "SECONDARY_ACCT") == account] for account in accountNums }

It is heinously slow; it takes around 30 seconds in iPython in powershell. Any ideas for how to speed it up?

CodePudding user response:

Almost certainly, you want something like:

account_map = {}
for cd in chargeDetails:
    account = getXMLDatum(cd, "SECONDARY_ACCT")
    if account not in account_map:
        account_map[account] = [cd]
    else:
        account_map[account].append(cd)

Which is the most naive way to group things using a dictionary. You could also use .setdefault or a collections.defaultdict, but the algorithm is fundamentally the same. But, for example:

account_map = {}
for cd in chargeDetails:
    account_map.setdefault(getXMLDatum(cd, "SECONDARY_ACCT"), []).append(cd)

would be another way to do it. But again, it's basically the same approach.

To contrast, here is the equivalent of what you were doing before with your dict/list comprehension:

accountMap = {}
for account in accountNums:
    temp_result = []
    for cd in chargeDetails:
        if getXMLDatum(cd, "SECONDARY_ACCT") == account:
            result.append(cd)
    accountMap[account] = temp_result

Notice, for every iteration that you do while you iterate over accountNums, you iterate over the entirety of chargeDetails, making it polynomial time. That is the issue.

Note, these will not be exactly equivalent if there are accounts that could get returned by getXMLDatum that aren't in accountNums, but you could just do something like:

account_map = {k:account_map[k] for k in (account_map.keys() & set(accountNums))}
  • Related