I have a dataset consisting of tokenized tuples. My steps of pre-processing were first, tokenizing the words, and then normalizing slang words. But then the slang words could consist of phrases with white spaces. I'm trying to do another round of tokenizing, but I couldn't figure out the way. Here's an example of my data.
firstTokenization normalized secondTokenization
0 [yes, no, cs] [yes, no, customer service] [yes, no, customer, service]
1 [nlp] [natural language processing] [natural, language, processing]
2 [no, yes] [no, yes] [no, yes]
I am trying to figure out a way to generate the secondTokenization column. Here's the code I'm currently working on...
tokenizer = MWETokenizer()
def tokenization (text):
return tokenizer.tokenize(text.split())
df['firstTokenization'] = df['content'].apply(lambda x: tokenization(x.lower()))
normalizad_word = pd.read_excel('normalisasi.xlsx')
normalizad_word_dict = {}
for index, row in normalizad_word.iterrows():
if row[0] not in normalizad_word_dict:
normalizad_word_dict[row[0]] = row[1]
def normalized_term(document):
return [normalizad_word_dict[term] if term in normalizad_word_dict else term for term in document]
df['normalized'] = df['firstTokenization'].apply(normalized_term)
CodePudding user response:
This works if your normalized column does not contain nested lists.
Setup:
import pandas as pd
df = pd.DataFrame({'firstTokenization': [['yes', 'no', 'cs'],
['nlp'],
['no', 'yes']],
'normalized': [['yes', 'no', 'customer service'],
['natural language processing'],
['no', 'yes']],
})
print(df)
Output:
firstTokenization normalized
0 [yes, no, cs] [yes, no, customer service]
1 [nlp] [natural language processing]
2 [no, yes] [no, yes]
The first apply splits the tokens on space, the second unnests the lists as shown in this answer.
df['secondTokenization'] = df['normalized'].apply(lambda x: [token.split(' ') for token in x]).apply(lambda y: [token for sublist in y for token in sublist])
print(df)
Output:
firstTokenization normalized \
0 [yes, no, cs] [yes, no, customer service]
1 [nlp] [natural language processing]
2 [no, yes] [no, yes]
secondTokenization
0 [yes, no, customer, service]
1 [natural, language, processing]
2 [no, yes]