Pandas lookup table using two arrays (index from the first array, column from the second array)-CodePudding

I'm struggling to vectorize a code that uses pandas lookup table, where the index is selected by the values from the first array and the column is selected by the values from the second array.

Let's say I have two numpy arrays a and b (they have the same shape):

codes = np.random.randint(1000, size=(4))
a_idx = np.random.randint(4, size=(6, 6))
a = codes[a_idx]
a[2, 1] = -999
a
Out[267]: 
array([[ 310,  310,   52,  310,  218,  310],
       [ 687,  310,  218,  310,  687,  687],
       [ 218, -999,  310,  218,   52,  687],
       [ 218,  218,  687,   52,  687,  310],
       [  52,  687,  687,   52,  687,  218],
       [  52,  218,   52,  687,  310,   52]])

b = np.random.randint(5, size=(6, 6))
b
Out[269]: 
array([[2, 4, 3, 2, 0, 4],
       [2, 4, 4, 2, 1, 0],
       [0, 0, 1, 1, 2, 0],
       [2, 2, 2, 2, 2, 1],
       [4, 1, 3, 1, 1, 2],
       [0, 3, 2, 2, 3, 0]])

I also have a pandas lookup table:

lookup = pd.DataFrame({'A': np.arange(1, 5),
                      'B': np.arange(11, 15),
                      'C': np.arange(21, 25)}, index=codes)
lookup.loc[-999] = 0
lookup
Out[275]: 
      A   B   C
 310  1  11  21
 687  2  12  22
 218  3  13  23
 52   4  14  24
-999  0   0   0

I have created a dictionary for pandas column names (different numbers can have the same letter):

b_dict = {0: 'A', 1: 'B', 2: 'C', 3: 'B', 4:'A'}

I want to create a third array from the lookup table, where the index is selected by the value from array a and the column is selected from the array b (with the help of b_dict). This is how it would be done by nested for loops:

res = np.empty_like(a)
for i, (row_a, row_b) in enumerate(zip(a, b)):
    for j, (aij, bij) in enumerate(zip(row_a, row_b)):
        res[i, j] = lookup.loc[aij, b_dict[bij]]

This would be the desired result:

res
Out[276]: 
array([[21,  1, 14, 21,  3,  1],
       [22,  1,  3, 21, 12,  2],
       [ 3,  0, 11, 13, 24,  2],
       [23, 23, 22, 24, 22, 11],
       [ 4, 12, 12, 14, 12, 23],
       [ 4, 13, 24, 22, 11,  4]])

Is there a faster (vectorized) way of doing this for large arrays using numpy or pandas, I would like to avoid nested loops?

EDIT: I changed the example to be more closer to the real problem.

CodePudding user response：

As the comments indicate, lookup.values[a, b] yields the exact same results as res is after the loop fills it.

So essentially instead of this:

res = np.empty_like(a)
for i, (row_a, row_b) in enumerate(zip(a, b)):
    for j, (aij, bij) in enumerate(zip(row_a, row_b)):
        res[i, j] = lookup.loc[aij, b_dict[bij]]

You can do this:

res = lookup.values[a, b]

Output:

>>> res
array([[ 4, 21,  4],
       [ 3,  4, 23],
       [23,  2, 11],
       [ 1, 13,  4]])

>>> lookup.values[a, b]
array([[ 4, 21,  4],
       [ 3,  4, 23],
       [23,  2, 11],
       [ 1, 13,  4]])

>>> res == lookup.values[a, b]
array([[ True,  True,  True],
       [ True,  True,  True],
       [ True,  True,  True],
       [ True,  True,  True]])

CodePudding user response：

You can flatten both arrays and use DataFrame.lookup to perform a label-based lookup, and reshape the result to the original shape of a and b

row_labels = a.ravel()
col_labels = pd.Series(b_dict)[b.ravel()].to_numpy()   
res = lookup.lookup(row_labels, col_labels).reshape(a.shape)

Using the same setup that your example

a = np.array([[ 310,  310,   52,  310,  218,  310],
              [ 687,  310,  218,  310,  687,  687],
              [ 218, -999,  310,  218,   52,  687],
              [ 218,  218,  687,   52,  687,  310],
              [  52,  687,  687,   52,  687,  218],
              [  52,  218,   52,  687,  310,   52]])

b = np.array([[2, 4, 3, 2, 0, 4],
              [2, 4, 4, 2, 1, 0],
              [0, 0, 1, 1, 2, 0],
              [2, 2, 2, 2, 2, 1],
              [4, 1, 3, 1, 1, 2],
              [0, 3, 2, 2, 3, 0]])

b_dict = {0: 'A', 1: 'B', 2: 'C', 3: 'B', 4:'A'}

lookup = pd.DataFrame({'A': [1, 2, 3, 4, 0], 
                       'B': [11, 12, 13, 14, 0], 
                       'C': [21, 22, 23, 24, 0]},
                       index=[310, 687, 218, 52, -999])

Output

>>> row_labels

array([ 310,  310,   52,  310,  218,  310,  687,  310,  218,  310,  687,
        687,  218, -999,  310,  218,   52,  687,  218,  218,  687,   52,
        687,  310,   52,  687,  687,   52,  687,  218,   52,  218,   52,
        687,  310,   52])

>>> col_labels

array(['C', 'A', 'B', 'C', 'A', 'A', 'C', 'A', 'A', 'C', 'B', 'A', 'A',
       'A', 'B', 'B', 'C', 'A', 'C', 'C', 'C', 'C', 'C', 'B', 'A', 'B',
       'B', 'B', 'B', 'C', 'A', 'B', 'C', 'C', 'B', 'A'], dtype=object)

>>> res

array([[21,  1, 14, 21,  3,  1],
       [22,  1,  3, 21, 12,  2],
       [ 3,  0, 11, 13, 24,  2],
       [23, 23, 22, 24, 22, 11],
       [ 4, 12, 12, 14, 12, 23],
       [ 4, 13, 24, 22, 11,  4]])