How to convert a pandas DataFrame into one-hot encoded?-CodePudding

Assume, I have a DataFrame with million rows. Here, each row represents one shopper, each number in each cell denotes item code. There are approximately 250 items in the data base. A toy table is like following

import pandas as pd
import numpy as np 

df = pd.DataFrame({'item1':[10, 10, 22, 89],
                   'item2':[15, 35, 33, 103],
                   'item3':[np.NaN, 65, 47, 41],
                   'item4':[np.NaN, np.NaN, 10, 22]})
df

item1	item2	item3	item4
10	15	NaN	NaN
10	35	65	NaN
22	33	47	10
89	103	41	22

The goal is to convert the above table into a one-hot encoded table/DataFrame (each row still represents one shopper) such as

1	...	10	...	15	...	250
0	0	1	...	1	...	0
0	0	1	...	0	...	0

Thus, the final data frame shape is something like (1000000, 250). Is there a way to convert a DataFrame into a one-hot encoded table quickly?

CodePudding user response：

Use sklearn's OneHotEncoder:

Set sparse=False since you want dense 2D output
fillna with some numeric value (e.g., -1) and drop that column afterwards
- Note: In pandas 1.4 , skip this step since it's now easy to drop NaN column names
groupby.sum to aggregate the duplicate columns (thanks to @enke)

from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse=False)
e = encoder.fit_transform(df.fillna(-1))

out = pd.DataFrame(e,
    columns=np.hstack(encoder.categories_).astype(int),
    index=encoder.feature_names_in_,
    dtype=int,
).drop(columns=[-1]).groupby(level=0, axis=1).sum()

Output:

        10   15   22   33   35   41   47   65   89   103
item1    1    1    0    0    0    0    0    0    0     0
item2    1    0    0    0    1    0    0    1    0     0
item3    1    0    1    1    0    0    1    0    0     0
item4    0    0    1    0    0    1    0    0    1     1

CodePudding user response：

The melt method might be useful.

Code:

# Solution 1

import numpy as np 
import pandas as pd

# Create the sample dataframe
df = pd.DataFrame({'item1':[10, 10, 22, 89], 'item2':[15, 35, 33, 103], 'item3':[np.NaN, 65, 47, 41], 'item4':[np.NaN, np.NaN, 10, 22]})

# Transform the df into one-hot-encoding
df = df.melt(ignore_index=False).reset_index().pivot_table('variable', 'index', 'value', aggfunc='count').fillna(0)

print(df)

# Solution 2

import numpy as np 
import pandas as pd

# Create the sample dataframe
df = pd.DataFrame({'item1':[10, 10, 22, 89], 'item2':[15, 35, 33, 103], 'item3':[np.NaN, 65, 47, 41], 'item4':[np.NaN, np.NaN, 10, 22]})

# Transform the df into one-hot-encoding
df = pd.get_dummies(df.melt(ignore_index=False).value).groupby(level=0).max()

print(df)

Output:

10.0	15.0	22.0	33.0	35.0	41.0	47.0	65.0	89.0	103.0
1	1	0	0	0	0	0	0	0	0
1	0	0	0	1	0	0	1	0	0
1	0	1	1	0	0	1	0	0	0
0	0	1	0	0	1	0	0	1	1

CodePudding user response：

IIUC, in the original DataFrame, the rows already represent shoppers, right? Then we could convert each entry in df to strings and use pd.get_dummies; then sum across to get a single column for each item:

out = pd.get_dummies(df.astype(str))
out.columns = out.columns.str.split('_').str[1].str.split('.').str[0]
out = out.drop(columns='nan').groupby(level=0, axis=1).sum()

Output:

   10  103  15  22  33  35  41  47  65  89
0   1    0   1   0   0   0   0   0   0   0
1   1    0   0   0   0   1   0   0   1   0
2   1    0   0   1   1   0   0   1   0   0
3   0    1   0   1   0   0   1   0   0   1