I have this string vocab file: https://drive.google.com/file/d/1mL461QGC5KcA3M1r8AESaPjZ3D_ufgPA/view?usp=sharing.
I have this sentences file, made from all vocab file above: https://drive.google.com/file/d/1w5ma4ROjyp6xmZfvnIQjsdH2I_K7lHoo/view?usp=sharing.
I want to map every sentences into its corresponding integer in the vocab file.
What I have tried to di is, firsti, I put all sentence into a list to this DataFrame:
import pandas as pd
f = open(f'./drive/MyDrive/[kepsdataset/train_preprocess.txt', "r")
output = []
dicts = {}
tokens = []
tags = []
for line in f:
if len(line.strip()) != 0:
fields = line.split('\t')
text = fields[0].lower()
tag = fields[1].strip()
tokens.append(text)
tags.append(tag)
else:
dicts['token'] = tokens # this is the sentences I want to map into integer
dicts['tag'] = tags
output.append(dicts)
dicts = {}
tokens = []
tags = []
df = pd.DataFrame(output)
df.head(10)
I have converted the vocabulary list (from vocab file) into list of integer
import numpy as np
my_file = open("vocab_uncased.txt", "r")
data = my_file.read()
data_into_list = data.split("\n")
print(data_into_list)
encoded_string = [np.where(np.array(list(dict.fromkeys(data_into_list)))==e)[0][0]for e in data_into_list]
print(encoded_string)
What I want to do is to put the encoded string into the DataFrame above. How can I do it? Example:
sentence (in token field in DataFrame): ['Setelah', 'melalui', 'proses', 'telepon', 'yang', 'panjang', 'tutup', 'sudah', 'kartu', 'kredit', 'bca', 'ribet']
encoded sentence (using vocab file): [2024, 1317, 1806, 2182, 2400, 1624, 2333, 2107, 1013, 1155, 317, 1853] --> to be put into a new dataframe column
CodePudding user response:
IIUC:
df = pd.DataFrame(output)
vocab = pd.Series(encoded_string, index=data_into_list)
df['encoded'] = df.explode(df.columns.tolist())['token'] \
.map(vocab).groupby(level=0).agg(list)
Output:
>>> df
token tag encoded
0 [setelah, melalui, proses, telepon, yang, panj... [O, B, B, I, O, O, B, O, B, I, I, B] [2024, 1317, 1806, 2182, 2400, 1624, 2333, 210...
1 [@halobca, saya, mencoba, mengakses, menu, m-b... [B, O, O, B, B, I, O, O, O, B, I, O, O, O, O, ... [130, 1917, 1374, 1403, 1470, 1240, 1917, 1545...
2 [hanya, saya, atau, @halobca, klikbca, bisnis,... [O, O, O, B, B, I, O, B] [857, 1917, 249, 130, 1130, 439, 1332, 767]
3 [teller, bank, bca, ini, menanyakan, kabar, sa... [O, O, O, O, O, O, O, B, O] [2190, 288, 317, 918, 1365, 983, 1917, 2081, 1...
4 [bca, senantiasa, menjaga, rahasia, data, cust... [B, O, B, B, B, I] [317, 1983, 1458, 1824, 575, 551]
.. ... ... ...
794 [hi, cs, kenapa, pelayanan, di, bca, kodya, te... [O, B, O, B, O, B, I, I, I, I, I, I, O, B, O, ... [873, 540, 1077, 1657, 598, 317, 1136, 2175, 2...
795 [walau, sudah, prioritas, tetap, saja, antreny... [O, O, B, O, O, B, B, O, O, B, O, O, O] [2374, 2107, 1791, 2281, 1885, 231, 1183, 282,...
796 [selama, menggunakan, layanan, e-channel, bca,... [O, B, O, B, I, O, O, O, B, I, B, I, B, B, B, ... [1966, 1427, 1198, 746, 317, 1520, 2288, 1341,...
797 [mau, menabung, mau, simpan, uang, atau, pun, ... [O, B, O, B, B, O, O, B, B, I, O, O, O, B] [1306, 1361, 1306, 2055, 2335, 249, 1817, 1491...
798 [toko, daring, juga, kebanyakan, pakai, bca, m... [B, I, O, O, B, I, I, O, O, O, B, B, I, I, B, ... [2297, 569, 976, 1037, 1609, 317, 1238, 258, 1...
[799 rows x 3 columns]