Ragged list to dataframe-CodePudding

I have a non-uniform list as follows:

[['E', 'A', 'P'],
 ['E', 'A', 'X', 'P'],
 ['E', 'A', 'P'],
 ['P'],
 ['E', 'A', 'X', 'P'],
 ['E', 'A', 'P'],
 ['A', 'X', 'P'],
 ['E', 'A', 'P'],
 ['E', 'A', 'P'],
 ['E', 'A', 'X', 'P'],
 ['E', 'A', 'P'],
 ['E', 'A', 'P'],
 ['A', 'X', 'P'],

I would like to create a data frame from this, where each column represents the four possible letters "E", "A", "X" and "p" in a one-hot encoded manner - what is the most efficient way to go about this?

CodePudding user response：

I would recommend MultiLabelBinarizer from sklearn

from sklearn.preprocessing import MultiLabelBinarizer
 
mlb = MultiLabelBinarizer()
df = pd.DataFrame(mlb.fit_transform(l),columns=mlb.classes_)
Out[170]: 
    A  E  P  X
0   1  1  1  0
1   1  1  1  1
2   1  1  1  0
3   0  0  1  0
4   1  1  1  1
5   1  1  1  0
6   1  0  1  1
7   1  1  1  0
8   1  1  1  0
9   1  1  1  1
10  1  1  1  0
11  1  1  1  0
12  1  0  1  1

Or we try pandas way with explode and str.get_dummies

df = pd.Series(l).explode().str.get_dummies().groupby(level=0).sum()
Out[176]: 
    A  E  P  X
0   1  1  1  0
1   1  1  1  1
2   1  1  1  0
3   0  0  1  0
4   1  1  1  1
5   1  1  1  0
6   1  0  1  1
7   1  1  1  0
8   1  1  1  0
9   1  1  1  1
10  1  1  1  0
11  1  1  1  0
12  1  0  1  1

Notice l is your list here

CodePudding user response：

Try:

lst = [
    ["E", "A", "P"],
    ["E", "A", "X", "P"],
    ["E", "A", "P"],
    ["P"],
    ["E", "A", "X", "P"],
    ["E", "A", "P"],
    ["A", "X", "P"],
    ["E", "A", "P"],
    ["E", "A", "P"],
    ["E", "A", "X", "P"],
    ["E", "A", "P"],
    ["E", "A", "P"],
    ["A", "X", "P"],
]

df = pd.DataFrame({v: 1 for v in l} for l in lst).notna().astype(int)
print(df)

Prints:

    E  A  P  X
0   1  1  1  0
1   1  1  1  1
2   1  1  1  0
3   0  0  1  0
4   1  1  1  1
5   1  1  1  0
6   0  1  1  1
7   1  1  1  0
8   1  1  1  0
9   1  1  1  1
10  1  1  1  0
11  1  1  1  0
12  0  1  1  1