Home > Net >  How to go through each row with pandas apply() and lambda to clean sentence tokens?
How to go through each row with pandas apply() and lambda to clean sentence tokens?

Time:04-03

My goal is to created a cleaned column of the tokenized sentence within the existing dataframe. The dataset is a pandas dataframe looking like this:

Index Tokenized_sents
First [Donald, Trump, just, couldn, t, wish, all, Am]
Second [On, Friday, ,, it, was, revealed, that]
dataset['cleaned_sents'] = dataset.apply(lambda row: [w for w in row["tokenized_sents"] if len(w)>2 and w.lower() not in stop_words], axis = 1)

My current output is the dataframe without that extra column.

Current outout:

    tokenized_sents  \
0  [Donald, Trump, just, couldn, t, wish, all, Am...  

Wanted output:

  tokenized_sents  \
0  [Donald, Trump, just, couldn, wish, all...   

Basically removing all the stopwords & short words

CodePudding user response:

Create a sentence index

dataset['gid'] = range(1, dataset.shape[0]   1)

       tokenized_sents  gid
0  [This, is, a, test]    1
1    [and, this, too!]    2

Then explode the dataframe

clean_df = dataset.explode('tokenized_sents')

  tokenized_sents  gid
0            This    1
0              is    1
0               a    1
0            test    1
1             and    2
1            this    2
1            too!    2

Do all the cleaning on this dataframe and use gid column to group them back. It will be the fastest way to go about doing it.

clean_df = clean_df[clean_df.tokenized_sents.str.len() >= 2]
.
.
.

To get it back,

clean_dataset = clean_df.groupby('gid').agg(list)

CodePudding user response:

Fix your code

dataset['new'] = dataset['tokenized_sents'].\
                   map(lambda x : [t for t in x if len(t)>2 and t not in stop] )
  • Related