Tokenize phrases in tokenized tuple-CodePudding

I have a dataset consisting of tokenized tuples. My steps of pre-processing were first, tokenizing the words, and then normalizing slang words. But then the slang words could consist of phrases with white spaces. I'm trying to do another round of tokenizing, but I couldn't figure out the way. Here's an example of my data.

  firstTokenization                       normalized               secondTokenization 
0     [yes, no, cs]      [yes, no, customer service]     [yes, no, customer, service] 
1             [nlp]    [natural language processing]  [natural, language, processing] 
2         [no, yes]                        [no, yes]                        [no, yes]

I am trying to figure out a way to generate the secondTokenization column. Here's the code I'm currently working on...

tokenizer = MWETokenizer()
def tokenization (text):
    return tokenizer.tokenize(text.split())
df['firstTokenization'] = df['content'].apply(lambda x: tokenization(x.lower()))

normalizad_word = pd.read_excel('normalisasi.xlsx')
normalizad_word_dict = {}

for index, row in normalizad_word.iterrows():
    if row[0] not in normalizad_word_dict:
        normalizad_word_dict[row[0]] = row[1] 
def normalized_term(document):
    return [normalizad_word_dict[term] if term in normalizad_word_dict else term for term in document]
df['normalized'] = df['firstTokenization'].apply(normalized_term)

CodePudding user response：

This works if your normalized column does not contain nested lists.

Setup:

import pandas as pd
    
df = pd.DataFrame({'firstTokenization': [['yes', 'no', 'cs'],
                                         ['nlp'],
                                         ['no', 'yes']],
                   'normalized': [['yes', 'no', 'customer service'],
                                  ['natural language processing'],
                                  ['no', 'yes']],
                   })

print(df)

Output:

  firstTokenization                     normalized
0     [yes, no, cs]    [yes, no, customer service]
1             [nlp]  [natural language processing]
2         [no, yes]                      [no, yes]

The first apply splits the tokens on space, the second unnests the lists as shown in this answer.

df['secondTokenization'] = df['normalized'].apply(lambda x: [token.split(' ') for token in x]).apply(lambda y: [token for sublist in y for token in sublist])
print(df)

Output:

  firstTokenization                     normalized  \
0     [yes, no, cs]    [yes, no, customer service]   
1             [nlp]  [natural language processing]   
2         [no, yes]                      [no, yes]   

                secondTokenization  
0     [yes, no, customer, service]  
1  [natural, language, processing]  
2                        [no, yes]