Remove stop words from sentences and pad sentences from a list of lists in the data frame-CodePudding

Is there an easy way to remove certain (stop) words from sentences in a list of lists in a dataframe column and (right)-pad them if they have a length less than the maximum length?

Example:

import pandas as pd

stopwords = ['the', 'a', 'an']
df = pd.DataFrame(data={'sentence': [[["the", "deer", 'was', 'a', 'tasty', 'meal'], ["the", "girl", 'walks'], ["thanks", "for", "all", "the", "gifts"]]]})

|    | sentence                                                                                                           |
|---:|:-------------------------------------------------------------------------------------------------------------------|
|  0 | [['the', 'deer', 'was', 'a', 'tasty', 'meal'], ['the', 'girl', 'walks'], ['thanks', 'for', 'all', 'the', 'gifts']] |

Expected result:

|    | sentence                            |
|---:|:------------------------------------|
|  0 | ['deer', 'was', 'tasty', 'meal']    |
|  1 | ['girl', 'walks', '<pad>', '<pad>'] |
|  2 | ['thanks', 'for', 'all', 'gifts']   |

CodePudding user response：

Try this:

x = df['sentence'].explode().reset_index(drop=True).explode().pipe(lambda x: x[~x.isin(stopwords)])
MAX = x.groupby(level=0).agg(len).max()
new_df = x.groupby(level=0).apply(lambda x: x.reset_index(drop=True).reindex(np.arange(MAX)).fillna('<pad>')).groupby(level=0).agg(list).to_frame()

Output:

>>> new_df
                      sentence
0     [deer, was, tasty, meal]
1  [girl, walks, <pad>, <pad>]
2    [thanks, for, all, gifts]

It uses explode twice to get the sub arrays all flattened, and then via pipe filters out the stop words. Then, we get the length of the longest group, and reindex each group to be as long as that. Note the fill value is <pad>, but you can change it to whatever you'd like, or even get rid of the fillna call altogether.

CodePudding user response：

Here is a way using reshaping:

df2 = (df.explode('sentence')
         .assign(group=lambda d: d.groupby(d.index).cumcount())
         .explode('sentence')
         .loc[lambda d: ~d['sentence'].isin(stopwords)]         # filter words
         .rename_axis('index')
         .assign(idx=lambda d: d.groupby(['index', 'group']).cumcount())
         .set_index(['group', 'idx'], append=True)
         .unstack('group')        # unstack/stack
         .fillna('<pad>')         # to pad
         .stack('group')          # missing words
         .groupby(level=[0, 'group']).agg(list)
         .reset_index('group', drop=True)
      )

output:

                      sentence
0     [deer, was, tasty, meal]
0  [girl, walks, <pad>, <pad>]
0    [thanks, for, all, gifts]

NB. this solution should also work on multiple input lines:

df = pd.concat([df]*3, ignore_index=True)

#                                             sentence
# 0  [[the, deer, was, a, tasty, meal], [the, girl,...
# 1  [[the, deer, was, a, tasty, meal], [the, girl,...
# 2  [[the, deer, was, a, tasty, meal], [the, girl,...

output:

                          sentence
index                             
0         [deer, was, tasty, meal]
0      [girl, walks, <pad>, <pad>]
0        [thanks, for, all, gifts]
1         [deer, was, tasty, meal]
1      [girl, walks, <pad>, <pad>]
1        [thanks, for, all, gifts]
2         [deer, was, tasty, meal]
2      [girl, walks, <pad>, <pad>]
2        [thanks, for, all, gifts]