map a dictionary on a list of dictionaries. How to optimize this without for loops?-CodePudding

So I have a list of objects with attributes, but they can be treated as a dictionary.

So category_objects:

category_objects = [{"power": 10, "speed":2, "control":3}, {"power":2, "control":3, "speed":10}, {"power": 5, "control":3, "speed":-10}

And a dictionary with mapping to index. str-> int index

CAT_TO_IDX = {"power":0, "speed":1, "control":2}

The dictionaries consist of more elements, and the list is Long in reality.

Currently it's done in this way:

categories = []
    for category_object in category_objects:
        cats = []
        for c in SASB_CAT_NAME_TO_IDX.keys():
            cats.append(getattr(category_object, c))
        categories.append(cats)

Desired result:

categories = [[10, 2, 3], [2,10,3], [5,-10,3]]

So the values of each category object, in order according to CAT_TO_IDX.

I have been trying to speed up vectorizing slow parts of our operations, but I cannot figure out this one. I ended up with this which is basically the same code and not much of a performance increase. Would want to find a way to vectorize or do it quickly with NumPy operations. Any idea how?

The ugly replacement code I tried..:

 def _num_key_apply(self, category_object, keys):
    return np.vectorize(category_object.__dict__.get)(keys)

 keys = np.array([*CAT_TO_IDX.keys(),], dtype=np.str)
 categories = [self._num_key_apply(category_object, keys) for category_object in category_objects]

Thank you! Any guidance is appreciated

CodePudding user response：

Since you mention that you are already using Pandas to put these data into a dataframe, you might as well take advantage of Pandas in the first place. You likely won't beat the performance by much, but it's fewer steps, which itself is an improvement in maintainability:

import pandas as pd

desired_column_order = ["power", "speed", "control"]  # etc.
df = pd.DataFrame(category_objects, columns=desired_column_order)

Demos:

>>> pd.DataFrame(category_objects, columns=["control", "power", "speed"])
   control  power  speed
0        3     10      2
1        3      2     10
2        3      5    -10

>>> pd.DataFrame(category_objects, columns=["speed", "control", "power"])
   speed  control  power
0      2        3     10
1     10        3      2
2    -10        3      5

>>> pd.DataFrame(category_objects, columns=["power", "speed", "control"])
   power  speed  control
0     10      2        3
1      2     10        3
2      5    -10        3

From here, you can "vectorize" operations on the numerical data, since Pandas is backed by numpy arrays. There is no "vectorization" for the loading of your data though. Vectorization depends on something called SIMD -- single instruction, multiple data -- which works by loading multiple data into a register, and then performing the same operation (like "multiply by 2" or "negate") on the entire register at the same time.

The np.vectorize() method is a glorified for loop -- it provides no benefits over a standard Python for loop.

CodePudding user response：

defaultdict can streamline the regrouping:

In [44]: from collections import defaultdict
In [45]: dd=defaultdict(list)

In [47]: for d in category_objects:
    ...:     for k,v in d.items():
    ...:         dd[k].append(v)
    ...:         

In [48]: dd
Out[48]: 
defaultdict(list,
            {'power': [10, 2, 5], 'speed': [2, 10, -10], 'control': [3, 3, 3]})