Given the below data stored in a text file:
fricative f, s, S, x, v, z, Z, G, h
nasal n, m, N
lateral r, l, j, J
labial p, b, m, f, v
coronal s, z, n, d, t, r, l, j, J, S, Z
dorsal g, k, G, x, N, h
frontal e, i, I, E, E:, E~, j, J,
How can I create a one-hot encoding function that places the individual letters in the first column and in front of each letter there is a one-hot encoding that describes the existing labels:
letters | fricative | nasal | lateral | labial | coronal | dorsal | frontal |
---|---|---|---|---|---|---|---|
e | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
f | 1 | 0 | 0 | 1 | 0 | 0 | 1 |
g | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
j | 0 | 0 | 1 | 0 | 1 | 0 | 1 |
I looked at this link, but it possible in a custom function like below:
def one_hot_labels(df):
'''
- for each line, create a dictionary indicating the presence (1)
or the absence (0) of every label
- put the dictionaries in the list and convert it to a data frame
'''
dict_labels = []
for i in (range(len(df)), leave=False):
d = dict(zip(range(n_labels), [0]*n_labels))
...
dict_labels.append(d)
df_labels = pd.DataFrame(dict_labels)
return df_labels
CodePudding user response:
Try .str.get_dummies()
:
# assuming the two columns are named `text` and `labels`
(df['letters'].str.replace(' ','') # remove all spaces
.str.get_dummies(',') # get the dummies
.set_index(df['text']) # assign the associated text
.T # transpose to match the requirement
)
This is what you get:
text fricative nasal lateral labial coronal dorsal frontal
E 0 0 0 0 0 0 1
E: 0 0 0 0 0 0 1
E~ 0 0 0 0 0 0 1
G 1 0 0 0 0 1 0
I 0 0 0 0 0 0 1
J 0 0 1 0 1 0 1
N 0 1 0 0 0 1 0
S 1 0 0 0 1 0 0
Z 1 0 0 0 1 0 0
b 0 0 0 1 0 0 0
d 0 0 0 0 1 0 0
e 0 0 0 0 0 0 1
f 1 0 0 1 0 0 0
g 0 0 0 0 0 1 0
h 1 0 0 0 0 1 0
i 0 0 0 0 0 0 1
j 0 0 1 0 1 0 1
k 0 0 0 0 0 1 0
l 0 0 1 0 1 0 0
m 0 1 0 1 0 0 0
n 0 1 0 0 1 0 0
p 0 0 0 1 0 0 0
r 0 0 1 0 1 0 0
s 1 0 0 0 1 0 0
t 0 0 0 0 1 0 0
v 1 0 0 1 0 0 0
x 1 0 0 0 0 1 0
z 1 0 0 0 1 0 0
CodePudding user response:
You can read in the values into a set
and a dict
and then construct a DataFrame
with the required conditions -
input_str = io.StringIO('''fricative f, s, S, x, v, z, Z, G, h
nasal n, m, N
lateral r, l, j, J
labial p, b, m, f, v
coronal s, z, n, d, t, r, l, j, J, S, Z
dorsal g, k, G, x, N, h
frontal e, i, I, E, E:, E~, j, J''')
category_to_letters = dict()
letters = set()
for input in input_str:
_category, *_letters = input.strip().split()
_letters = set(_letter.split(',')[0] for _letter in _letters if _letter.split(',')[0].strip())
category_to_letters[_category] = _letters
letters = letters.union(set(_letters))
df = pd.DataFrame({}, index=letters, columns=category_to_letters.keys())
for col in df.columns:
df.loc[:, col] = df.index.isin(category_to_letters[col])