Home > OS >  How to map list of string to existing list of integer?
How to map list of string to existing list of integer?

Time:03-27

I have this string vocab file: https://drive.google.com/file/d/1mL461QGC5KcA3M1r8AESaPjZ3D_ufgPA/view?usp=sharing.

I have this sentences file, made from all vocab file above: https://drive.google.com/file/d/1w5ma4ROjyp6xmZfvnIQjsdH2I_K7lHoo/view?usp=sharing.

I want to map every sentences into its corresponding integer in the vocab file.

What I have tried to di is, firsti, I put all sentence into a list to this DataFrame:

import pandas as pd

f = open(f'./drive/MyDrive/[kepsdataset/train_preprocess.txt', "r")
output = []
dicts = {}
tokens = []
tags = []

for line in f:
  if len(line.strip()) != 0:
    fields = line.split('\t')
    text = fields[0].lower()
    tag = fields[1].strip()
    tokens.append(text)
    tags.append(tag)
  else:
    dicts['token'] = tokens # this is the sentences I want to map into integer
    dicts['tag'] = tags
    output.append(dicts)
    dicts = {}
    tokens = []
    tags = []
    
df = pd.DataFrame(output)

df.head(10)

I have converted the vocabulary list (from vocab file) into list of integer

import numpy as np

my_file = open("vocab_uncased.txt", "r")
  
data = my_file.read()
  
data_into_list = data.split("\n")
print(data_into_list)

encoded_string = [np.where(np.array(list(dict.fromkeys(data_into_list)))==e)[0][0]for e in data_into_list]
print(encoded_string)

What I want to do is to put the encoded string into the DataFrame above. How can I do it? Example:

sentence (in token field in DataFrame): ['Setelah', 'melalui', 'proses', 'telepon', 'yang', 'panjang', 'tutup', 'sudah', 'kartu', 'kredit', 'bca', 'ribet'] 
encoded sentence (using vocab file): [2024, 1317, 1806, 2182, 2400, 1624, 2333, 2107, 1013, 1155, 317, 1853] --> to be put into a new dataframe column

CodePudding user response:

IIUC:

df = pd.DataFrame(output)
vocab = pd.Series(encoded_string, index=data_into_list)

df['encoded'] = df.explode(df.columns.tolist())['token'] \
                  .map(vocab).groupby(level=0).agg(list)

Output:

>>> df
                                                 token                                                tag                                            encoded
0    [setelah, melalui, proses, telepon, yang, panj...               [O, B, B, I, O, O, B, O, B, I, I, B]  [2024, 1317, 1806, 2182, 2400, 1624, 2333, 210...
1    [@halobca, saya, mencoba, mengakses, menu, m-b...  [B, O, O, B, B, I, O, O, O, B, I, O, O, O, O, ...  [130, 1917, 1374, 1403, 1470, 1240, 1917, 1545...
2    [hanya, saya, atau, @halobca, klikbca, bisnis,...                           [O, O, O, B, B, I, O, B]        [857, 1917, 249, 130, 1130, 439, 1332, 767]
3    [teller, bank, bca, ini, menanyakan, kabar, sa...                        [O, O, O, O, O, O, O, B, O]  [2190, 288, 317, 918, 1365, 983, 1917, 2081, 1...
4    [bca, senantiasa, menjaga, rahasia, data, cust...                                 [B, O, B, B, B, I]                  [317, 1983, 1458, 1824, 575, 551]
..                                                 ...                                                ...                                                ...
794  [hi, cs, kenapa, pelayanan, di, bca, kodya, te...  [O, B, O, B, O, B, I, I, I, I, I, I, O, B, O, ...  [873, 540, 1077, 1657, 598, 317, 1136, 2175, 2...
795  [walau, sudah, prioritas, tetap, saja, antreny...            [O, O, B, O, O, B, B, O, O, B, O, O, O]  [2374, 2107, 1791, 2281, 1885, 231, 1183, 282,...
796  [selama, menggunakan, layanan, e-channel, bca,...  [O, B, O, B, I, O, O, O, B, I, B, I, B, B, B, ...  [1966, 1427, 1198, 746, 317, 1520, 2288, 1341,...
797  [mau, menabung, mau, simpan, uang, atau, pun, ...         [O, B, O, B, B, O, O, B, B, I, O, O, O, B]  [1306, 1361, 1306, 2055, 2335, 249, 1817, 1491...
798  [toko, daring, juga, kebanyakan, pakai, bca, m...  [B, I, O, O, B, I, I, O, O, O, B, B, I, I, B, ...  [2297, 569, 976, 1037, 1609, 317, 1238, 258, 1...

[799 rows x 3 columns]
  • Related