I have a question about how to possibly optimize a dictionary comprehension.
Setup:
I'm dealing with looping over XML minidom objects. Along with importing xml.dom.minidom I've defined a convenience function, getXMLDatum, in a custom package like so:
def getXMLDatum(container, elName):
return container.getElementsByTagName(elName)[0].firstChild.nodeValue
getXMLDatum simply "returns the value between the brackets" of a given (parent, child) couple in the XML document.
I have a list of distinct accounts that's around 300 items in size
In [1] len(accountNums)
Out[1] 285
I have a list of XML minidom objects that are denoted as "charge details", each of which contain a reference to exactly one account in accountNums
In [2] getXMLDatum(chargeDetails[0], "SECONDARY_ACCT")
Out[2] '00018608'
There are around 1500 charge details:
In [3] len(chargeDetails)
Out[3] 1426
I'm trying to create a mapping where I can plug in an account and get all the charge details associated with it:
In [4] accountMap = { account : [cd for cd in chargeDetails if getXMLDatum(cd, "SECONDARY_ACCT") == account] for account in accountNums }
It is heinously slow; it takes around 30 seconds in iPython in powershell. Any ideas for how to speed it up?
CodePudding user response:
Almost certainly, you want something like:
account_map = {}
for cd in chargeDetails:
account = getXMLDatum(cd, "SECONDARY_ACCT")
if account not in account_map:
account_map[account] = [cd]
else:
account_map[account].append(cd)
Which is the most naive way to group things using a dictionary. You could also use .setdefault
or a collections.defaultdict
, but the algorithm is fundamentally the same. But, for example:
account_map = {}
for cd in chargeDetails:
account_map.setdefault(getXMLDatum(cd, "SECONDARY_ACCT"), []).append(cd)
would be another way to do it. But again, it's basically the same approach.
To contrast, here is the equivalent of what you were doing before with your dict
/list
comprehension:
accountMap = {}
for account in accountNums:
temp_result = []
for cd in chargeDetails:
if getXMLDatum(cd, "SECONDARY_ACCT") == account:
result.append(cd)
accountMap[account] = temp_result
Notice, for every iteration that you do while you iterate over accountNums
, you iterate over the entirety of chargeDetails
, making it polynomial time. That is the issue.
Note, these will not be exactly equivalent if there are accounts that could get returned by getXMLDatum
that aren't in accountNums
, but you could just do something like:
account_map = {k:account_map[k] for k in (account_map.keys() & set(accountNums))}