Select tuple based on number of item-CodePudding

I am doing NLP. I've done tokenization and my data has became tuples. Now, I want to select data that contains more than 4 items (words). Here is a sample of my dataset.

ID                                content
 0         [yes, no, check, sample, word]
 1                           [never, you]
 2 [non, program, more, link, draft, ask]
 3                                 [able]
 4       [to, ask, you, other, man, will]

I want to make a new data set that contains data number 0, 2, and 4 (has more than 4 items). Here's a sample of it.

ID                                content
 0         [yes, no, check, sample, word]
 2 [non, program, more, link, draft, ask]
 4       [to, ask, you, other, man, will]

This is the code that I'm working on...

df_new = df.loc[df.content.map(len).ne(>4)]

CodePudding user response：

You can use pandas.Series.gt.

>>> import pandas as pd
>>> 
>>> df = pd.DataFrame({'ID': [0, 1], 'content': [['yes', 'no', 'check', 'sample', 'word'], ['able']]})
>>> df
   ID                         content
0   0  [yes, no, check, sample, word]
1   1                          [able]
>>> df[df.content.map(len).gt(4)]
   ID                         content
0   0  [yes, no, check, sample, word]

CodePudding user response：

You can use ge (greater or equal to), not ne as follows:

import pandas as pd

df = pd.DataFrame({
    'content': [
        ['yes', 'no', 'check', 'sample', 'word'],
        ['never', 'you'],
        ['non', 'program', 'more', 'link', 'draft', 'ask'],
        ['able'],
        ['to', 'ask', 'you', 'other', 'man', 'will']],
})

df_new = df.loc[df.content.map(len).ge(4)]

print(df_new)
"""
                                  content
0          [yes, no, check, sample, word]
2  [non, program, more, link, draft, ask]
4        [to, ask, you, other, man, will]
"""

For more information, see: https://pandas.pydata.org/docs/reference/api/pandas.Series.ge.html