I am doing NLP. I've done tokenization and my data has became tuples. Now, I want to select data that contains more than 4 items (words). Here is a sample of my dataset.
ID content
0 [yes, no, check, sample, word]
1 [never, you]
2 [non, program, more, link, draft, ask]
3 [able]
4 [to, ask, you, other, man, will]
I want to make a new data set that contains data number 0, 2, and 4 (has more than 4 items). Here's a sample of it.
ID content
0 [yes, no, check, sample, word]
2 [non, program, more, link, draft, ask]
4 [to, ask, you, other, man, will]
This is the code that I'm working on...
df_new = df.loc[df.content.map(len).ne(>4)]
CodePudding user response:
You can use pandas.Series.gt
.
>>> import pandas as pd
>>>
>>> df = pd.DataFrame({'ID': [0, 1], 'content': [['yes', 'no', 'check', 'sample', 'word'], ['able']]})
>>> df
ID content
0 0 [yes, no, check, sample, word]
1 1 [able]
>>> df[df.content.map(len).gt(4)]
ID content
0 0 [yes, no, check, sample, word]
CodePudding user response:
You can use ge
(greater or equal to), not ne
as follows:
import pandas as pd
df = pd.DataFrame({
'content': [
['yes', 'no', 'check', 'sample', 'word'],
['never', 'you'],
['non', 'program', 'more', 'link', 'draft', 'ask'],
['able'],
['to', 'ask', 'you', 'other', 'man', 'will']],
})
df_new = df.loc[df.content.map(len).ge(4)]
print(df_new)
"""
content
0 [yes, no, check, sample, word]
2 [non, program, more, link, draft, ask]
4 [to, ask, you, other, man, will]
"""
For more information, see: https://pandas.pydata.org/docs/reference/api/pandas.Series.ge.html