I have a non-uniform list as follows:
[['E', 'A', 'P'],
['E', 'A', 'X', 'P'],
['E', 'A', 'P'],
['P'],
['E', 'A', 'X', 'P'],
['E', 'A', 'P'],
['A', 'X', 'P'],
['E', 'A', 'P'],
['E', 'A', 'P'],
['E', 'A', 'X', 'P'],
['E', 'A', 'P'],
['E', 'A', 'P'],
['A', 'X', 'P'],
I would like to create a data frame from this, where each column represents the four possible letters "E"
, "A"
, "X"
and "p"
in a one-hot encoded manner - what is the most efficient way to go about this?
CodePudding user response:
I would recommend MultiLabelBinarizer
from sklearn
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = pd.DataFrame(mlb.fit_transform(l),columns=mlb.classes_)
Out[170]:
A E P X
0 1 1 1 0
1 1 1 1 1
2 1 1 1 0
3 0 0 1 0
4 1 1 1 1
5 1 1 1 0
6 1 0 1 1
7 1 1 1 0
8 1 1 1 0
9 1 1 1 1
10 1 1 1 0
11 1 1 1 0
12 1 0 1 1
Or we try pandas way with explode
and str.get_dummies
df = pd.Series(l).explode().str.get_dummies().groupby(level=0).sum()
Out[176]:
A E P X
0 1 1 1 0
1 1 1 1 1
2 1 1 1 0
3 0 0 1 0
4 1 1 1 1
5 1 1 1 0
6 1 0 1 1
7 1 1 1 0
8 1 1 1 0
9 1 1 1 1
10 1 1 1 0
11 1 1 1 0
12 1 0 1 1
Notice l
is your list
here
CodePudding user response:
Try:
lst = [
["E", "A", "P"],
["E", "A", "X", "P"],
["E", "A", "P"],
["P"],
["E", "A", "X", "P"],
["E", "A", "P"],
["A", "X", "P"],
["E", "A", "P"],
["E", "A", "P"],
["E", "A", "X", "P"],
["E", "A", "P"],
["E", "A", "P"],
["A", "X", "P"],
]
df = pd.DataFrame({v: 1 for v in l} for l in lst).notna().astype(int)
print(df)
Prints:
E A P X
0 1 1 1 0
1 1 1 1 1
2 1 1 1 0
3 0 0 1 0
4 1 1 1 1
5 1 1 1 0
6 0 1 1 1
7 1 1 1 0
8 1 1 1 0
9 1 1 1 1
10 1 1 1 0
11 1 1 1 0
12 0 1 1 1