create one-hot encoding for multi-labels-CodePudding

Given the below data stored in a text file:

fricative     f, s, S, x, v, z, Z, G, h
nasal       n, m, N
lateral   r, l, j, J
labial      p, b, m, f, v
coronal   s, z, n, d, t, r, l, j, J, S, Z
dorsal      g, k, G, x, N, h
frontal   e, i, I, E, E:, E~, j, J,

How can I create a one-hot encoding function that places the individual letters in the first column and in front of each letter there is a one-hot encoding that describes the existing labels:

letters	fricative	lateral	labial	coronal	dorsal	frontal
e	0	0	0	0	0	1
f	1	0	1	0	0	1
g	0	0	0	0	1	0
j	0	1	0	1	0	1

I looked at this link, but it possible in a custom function like below:

def one_hot_labels(df):
    '''
    - for each line, create a dictionary indicating the presence (1) 
or the absence (0) of every label
    - put the dictionaries in the list and convert it to a data frame
    '''
dict_labels = []
for i in (range(len(df)), leave=False):
    d = dict(zip(range(n_labels), [0]*n_labels))
    ...
    dict_labels.append(d)

    df_labels = pd.DataFrame(dict_labels)
return df_labels

CodePudding user response：

Try .str.get_dummies():

# assuming the two columns are named `text` and `labels`
(df['letters'].str.replace(' ','')   # remove all spaces
    .str.get_dummies(',')            # get the dummies
    .set_index(df['text'])           # assign the associated text
    .T                               # transpose to match the requirement
)

This is what you get:

text  fricative  nasal  lateral  labial  coronal  dorsal  frontal
E             0      0        0       0        0       0        1
E:            0      0        0       0        0       0        1
E~            0      0        0       0        0       0        1
G             1      0        0       0        0       1        0
I             0      0        0       0        0       0        1
J             0      0        1       0        1       0        1
N             0      1        0       0        0       1        0
S             1      0        0       0        1       0        0
Z             1      0        0       0        1       0        0
b             0      0        0       1        0       0        0
d             0      0        0       0        1       0        0
e             0      0        0       0        0       0        1
f             1      0        0       1        0       0        0
g             0      0        0       0        0       1        0
h             1      0        0       0        0       1        0
i             0      0        0       0        0       0        1
j             0      0        1       0        1       0        1
k             0      0        0       0        0       1        0
l             0      0        1       0        1       0        0
m             0      1        0       1        0       0        0
n             0      1        0       0        1       0        0
p             0      0        0       1        0       0        0
r             0      0        1       0        1       0        0
s             1      0        0       0        1       0        0
t             0      0        0       0        1       0        0
v             1      0        0       1        0       0        0
x             1      0        0       0        0       1        0
z             1      0        0       0        1       0        0

CodePudding user response：

You can read in the values into a set and a dict and then construct a DataFrame with the required conditions -

input_str = io.StringIO('''fricative     f, s, S, x, v, z, Z, G, h
nasal       n, m, N
lateral   r, l, j, J
labial      p, b, m, f, v
coronal   s, z, n, d, t, r, l, j, J, S, Z
dorsal      g, k, G, x, N, h
frontal   e, i, I, E, E:, E~, j, J''')

category_to_letters = dict()
letters = set()

for input in input_str:
    _category, *_letters = input.strip().split()
    _letters = set(_letter.split(',')[0] for _letter in _letters if _letter.split(',')[0].strip())
    category_to_letters[_category] = _letters
    letters = letters.union(set(_letters))

df = pd.DataFrame({}, index=letters, columns=category_to_letters.keys())
for col in df.columns:
    df.loc[:, col] = df.index.isin(category_to_letters[col])