Home > OS >  One-hot encode a column of integers into a NumPy matrix, including missing indices
One-hot encode a column of integers into a NumPy matrix, including missing indices

Time:10-09

From the following NumPy array:

[5, 2, 4, 6, 3]

I'd like to get to the following matrix:

[
    [0, 0, 0, 0, 0, 1, 0],
    [0, 0, 1, 0, 0, 0, 0],
    [0, 0, 0, 0, 1, 0, 0],
    [0, 0, 0, 0, 0, 0, 1],
    [0, 0, 0, 1, 0, 0, 0]
]

Using Pandas get_dummies appears very simple:

pd.get_dummies(original_array).values

But it has one drawback, in that missing indices are not represented as columns (e.g. 0, 1 in this example) in the final matrix.

If we assume that the exact names/indices of the desired "columns" are known in advance (here, all integers from 0 to 6 included), what would be the most efficient way to get to the matrix shown above, starting from the initial array?

CodePudding user response:

You can create a zeros matrix and then use advanced indexing to assign one to correct columns:

a = [5, 2, 4, 6, 3]

ohe = np.zeros((len(a), max(a)   1), dtype=int)
ohe[np.arange(len(a)), a] = 1

ohe
array([[0, 0, 0, 0, 0, 1, 0],
       [0, 0, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 0],
       [0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 1, 0, 0, 0]])

CodePudding user response:

Advanced indexing is your answer! Assuming you know your desired final shape (here, (5, 7)):

In [5]: desired_shape = (5, 7)

In [6]: z = np.zeros(desired_shape, dtype="uint8")

In [5]: z
Out[5]:
array([[0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0]], dtype=uint8)

In [6]: idxs = [5, 2, 4, 6, 3]

In [7]: z[range(len(z)), idxs] = 1

In [8]: z
Out[8]:
array([[0, 0, 0, 0, 0, 1, 0],
       [0, 0, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 0],
       [0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 1, 0, 0, 0]], dtype=uint8)
  • Related