Assume, I have a DataFrame with million rows. Here, each row represents one shopper, each number in each cell denotes item code. There are approximately 250 items in the data base. A toy table is like following
import pandas as pd
import numpy as np
df = pd.DataFrame({'item1':[10, 10, 22, 89],
'item2':[15, 35, 33, 103],
'item3':[np.NaN, 65, 47, 41],
'item4':[np.NaN, np.NaN, 10, 22]})
df
item1 | item2 | item3 | item4 |
---|---|---|---|
10 | 15 | NaN | NaN |
10 | 35 | 65 | NaN |
22 | 33 | 47 | 10 |
89 | 103 | 41 | 22 |
The goal is to convert the above table into a one-hot encoded table/DataFrame (each row still represents one shopper) such as
1 | ... | 10 | ... | 15 | ... | 250 |
---|---|---|---|---|---|---|
0 | 0 | 1 | ... | 1 | ... | 0 |
0 | 0 | 1 | ... | 0 | ... | 0 |
Thus, the final data frame shape is something like (1000000, 250). Is there a way to convert a DataFrame into a one-hot encoded table quickly?
CodePudding user response:
Use sklearn's OneHotEncoder
:
- Set
sparse=False
since you want dense 2D output fillna
with some numeric value (e.g., -1) and drop that column afterwards- Note: In pandas 1.4 , skip this step since it's now easy to drop NaN column names
groupby.sum
to aggregate the duplicate columns (thanks to @enke)
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse=False)
e = encoder.fit_transform(df.fillna(-1))
out = pd.DataFrame(e,
columns=np.hstack(encoder.categories_).astype(int),
index=encoder.feature_names_in_,
dtype=int,
).drop(columns=[-1]).groupby(level=0, axis=1).sum()
Output:
10 15 22 33 35 41 47 65 89 103
item1 1 1 0 0 0 0 0 0 0 0
item2 1 0 0 0 1 0 0 1 0 0
item3 1 0 1 1 0 0 1 0 0 0
item4 0 0 1 0 0 1 0 0 1 1
CodePudding user response:
The melt
method might be useful.
Code:
# Solution 1
import numpy as np
import pandas as pd
# Create the sample dataframe
df = pd.DataFrame({'item1':[10, 10, 22, 89], 'item2':[15, 35, 33, 103], 'item3':[np.NaN, 65, 47, 41], 'item4':[np.NaN, np.NaN, 10, 22]})
# Transform the df into one-hot-encoding
df = df.melt(ignore_index=False).reset_index().pivot_table('variable', 'index', 'value', aggfunc='count').fillna(0)
print(df)
# Solution 2
import numpy as np
import pandas as pd
# Create the sample dataframe
df = pd.DataFrame({'item1':[10, 10, 22, 89], 'item2':[15, 35, 33, 103], 'item3':[np.NaN, 65, 47, 41], 'item4':[np.NaN, np.NaN, 10, 22]})
# Transform the df into one-hot-encoding
df = pd.get_dummies(df.melt(ignore_index=False).value).groupby(level=0).max()
print(df)
Output:
10.0 | 15.0 | 22.0 | 33.0 | 35.0 | 41.0 | 47.0 | 65.0 | 89.0 | 103.0 |
---|---|---|---|---|---|---|---|---|---|
1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
1 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 1 |
CodePudding user response:
IIUC, in the original DataFrame, the rows already represent shoppers, right? Then we could convert each entry in df
to strings and use pd.get_dummies
; then sum across to get a single column for each item:
out = pd.get_dummies(df.astype(str))
out.columns = out.columns.str.split('_').str[1].str.split('.').str[0]
out = out.drop(columns='nan').groupby(level=0, axis=1).sum()
Output:
10 103 15 22 33 35 41 47 65 89
0 1 0 1 0 0 0 0 0 0 0
1 1 0 0 0 0 1 0 0 1 0
2 1 0 0 1 1 0 0 1 0 0
3 0 1 0 1 0 0 1 0 0 1