Is there an easy way to remove certain (stop) words from sentences in a list of lists in a dataframe column and (right)-pad them if they have a length less than the maximum length?
Example:
import pandas as pd
stopwords = ['the', 'a', 'an']
df = pd.DataFrame(data={'sentence': [[["the", "deer", 'was', 'a', 'tasty', 'meal'], ["the", "girl", 'walks'], ["thanks", "for", "all", "the", "gifts"]]]})
| | sentence |
|---:|:-------------------------------------------------------------------------------------------------------------------|
| 0 | [['the', 'deer', 'was', 'a', 'tasty', 'meal'], ['the', 'girl', 'walks'], ['thanks', 'for', 'all', 'the', 'gifts']] |
Expected result:
| | sentence |
|---:|:------------------------------------|
| 0 | ['deer', 'was', 'tasty', 'meal'] |
| 1 | ['girl', 'walks', '<pad>', '<pad>'] |
| 2 | ['thanks', 'for', 'all', 'gifts'] |
CodePudding user response:
Try this:
x = df['sentence'].explode().reset_index(drop=True).explode().pipe(lambda x: x[~x.isin(stopwords)])
MAX = x.groupby(level=0).agg(len).max()
new_df = x.groupby(level=0).apply(lambda x: x.reset_index(drop=True).reindex(np.arange(MAX)).fillna('<pad>')).groupby(level=0).agg(list).to_frame()
Output:
>>> new_df
sentence
0 [deer, was, tasty, meal]
1 [girl, walks, <pad>, <pad>]
2 [thanks, for, all, gifts]
It uses explode
twice to get the sub arrays all flattened, and then via pipe
filters out the stop words. Then, we get the length of the longest group, and reindex
each group to be as long as that. Note the fill value is <pad>
, but you can change it to whatever you'd like, or even get rid of the fillna
call altogether.
CodePudding user response:
Here is a way using reshaping:
df2 = (df.explode('sentence')
.assign(group=lambda d: d.groupby(d.index).cumcount())
.explode('sentence')
.loc[lambda d: ~d['sentence'].isin(stopwords)] # filter words
.rename_axis('index')
.assign(idx=lambda d: d.groupby(['index', 'group']).cumcount())
.set_index(['group', 'idx'], append=True)
.unstack('group') # unstack/stack
.fillna('<pad>') # to pad
.stack('group') # missing words
.groupby(level=[0, 'group']).agg(list)
.reset_index('group', drop=True)
)
output:
sentence
0 [deer, was, tasty, meal]
0 [girl, walks, <pad>, <pad>]
0 [thanks, for, all, gifts]
NB. this solution should also work on multiple input lines:
df = pd.concat([df]*3, ignore_index=True)
# sentence
# 0 [[the, deer, was, a, tasty, meal], [the, girl,...
# 1 [[the, deer, was, a, tasty, meal], [the, girl,...
# 2 [[the, deer, was, a, tasty, meal], [the, girl,...
output:
sentence
index
0 [deer, was, tasty, meal]
0 [girl, walks, <pad>, <pad>]
0 [thanks, for, all, gifts]
1 [deer, was, tasty, meal]
1 [girl, walks, <pad>, <pad>]
1 [thanks, for, all, gifts]
2 [deer, was, tasty, meal]
2 [girl, walks, <pad>, <pad>]
2 [thanks, for, all, gifts]