Home > Mobile >  How to convert a pandas DataFrame into one-hot encoded?
How to convert a pandas DataFrame into one-hot encoded?

Time:02-26

Assume, I have a DataFrame with million rows. Here, each row represents one shopper, each number in each cell denotes item code. There are approximately 250 items in the data base. A toy table is like following

import pandas as pd
import numpy as np 

df = pd.DataFrame({'item1':[10, 10, 22, 89],
                   'item2':[15, 35, 33, 103],
                   'item3':[np.NaN, 65, 47, 41],
                   'item4':[np.NaN, np.NaN, 10, 22]})
df
item1 item2 item3 item4
10 15 NaN NaN
10 35 65 NaN
22 33 47 10
89 103 41 22

The goal is to convert the above table into a one-hot encoded table/DataFrame (each row still represents one shopper) such as

1 ... 10 ... 15 ... 250
0 0 1 ... 1 ... 0
0 0 1 ... 0 ... 0

Thus, the final data frame shape is something like (1000000, 250). Is there a way to convert a DataFrame into a one-hot encoded table quickly?

CodePudding user response:

Use sklearn's OneHotEncoder:

  • Set sparse=False since you want dense 2D output
  • fillna with some numeric value (e.g., -1) and drop that column afterwards
  • groupby.sum to aggregate the duplicate columns (thanks to @enke)
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse=False)
e = encoder.fit_transform(df.fillna(-1))

out = pd.DataFrame(e,
    columns=np.hstack(encoder.categories_).astype(int),
    index=encoder.feature_names_in_,
    dtype=int,
).drop(columns=[-1]).groupby(level=0, axis=1).sum()

Output:

        10   15   22   33   35   41   47   65   89   103
item1    1    1    0    0    0    0    0    0    0     0
item2    1    0    0    0    1    0    0    1    0     0
item3    1    0    1    1    0    0    1    0    0     0
item4    0    0    1    0    0    1    0    0    1     1

CodePudding user response:

The melt method might be useful.

Code:

# Solution 1

import numpy as np 
import pandas as pd

# Create the sample dataframe
df = pd.DataFrame({'item1':[10, 10, 22, 89], 'item2':[15, 35, 33, 103], 'item3':[np.NaN, 65, 47, 41], 'item4':[np.NaN, np.NaN, 10, 22]})

# Transform the df into one-hot-encoding
df = df.melt(ignore_index=False).reset_index().pivot_table('variable', 'index', 'value', aggfunc='count').fillna(0)

print(df)

# Solution 2

import numpy as np 
import pandas as pd

# Create the sample dataframe
df = pd.DataFrame({'item1':[10, 10, 22, 89], 'item2':[15, 35, 33, 103], 'item3':[np.NaN, 65, 47, 41], 'item4':[np.NaN, np.NaN, 10, 22]})

# Transform the df into one-hot-encoding
df = pd.get_dummies(df.melt(ignore_index=False).value).groupby(level=0).max()

print(df)

Output:

10.0 15.0 22.0 33.0 35.0 41.0 47.0 65.0 89.0 103.0
1 1 0 0 0 0 0 0 0 0
1 0 0 0 1 0 0 1 0 0
1 0 1 1 0 0 1 0 0 0
0 0 1 0 0 1 0 0 1 1

CodePudding user response:

IIUC, in the original DataFrame, the rows already represent shoppers, right? Then we could convert each entry in df to strings and use pd.get_dummies; then sum across to get a single column for each item:

out = pd.get_dummies(df.astype(str))
out.columns = out.columns.str.split('_').str[1].str.split('.').str[0]
out = out.drop(columns='nan').groupby(level=0, axis=1).sum()

Output:

   10  103  15  22  33  35  41  47  65  89
0   1    0   1   0   0   0   0   0   0   0
1   1    0   0   0   0   1   0   0   1   0
2   1    0   0   1   1   0   0   1   0   0
3   0    1   0   1   0   0   1   0   0   1
  • Related