I have the following data frame, df
:
id text
1 'a little table'
2 'blue lights'
3 'food and drink'
4 'build an atom'
5 'fast animals'
and a list of stop words, that is:
sw = ['a', 'an', 'and']
I want to delete the lines that contain at least one of the stop words (as words themselves, not as substrings). That is, the result I would like is:
id text
2 'blue lights'
5 'fast animals'
I was trying with:
df[~df['text'].str.contains('|'.join(sw), regex=True, na=False)]
but it doesn't seem to work, as it works with substrings this way, and a
is substring of all texts (except for 'blue lights'). How should I change my line of code?
CodePudding user response:
li = ['a', 'an', 'and']
for i in li:
for k in df.index:
if i in df.text[k].split():
df.drop(k,inplace=True)
CodePudding user response:
If you want to use str.contains
, you could try as follows:
import pandas as pd
data = {'id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
'text': {0: "'a little table'", 1: "'blue lights'",
2: "'food and drink'", 3: "'build an atom'",
4: "'fast animals'"}}
df = pd.DataFrame(data)
sw = ['a', 'an', 'and']
res = df[~df['text'].str.contains(fr'\b(?:{"|".join(sw)})\b',
regex=True, na=False)]
print(res)
id text
1 2 'blue lights'
4 5 'fast animals'
In the regex pattern \b
asserts position at a word boundary, while ?:
at start of pattern between (...)
creates a non-capturing group
. Strictly speaking, you could do without ?:
, but it suppresses a Userwarning
: "This pattern ... has match groups etc.".
`
CodePudding user response:
You can also use the custom apply() method,
def string_present(List,string):
return any(ele ' ' in string for ele in List)
df['status'] = df['text'].apply(lambda row: string_present(sw,row))
df[df['status']==False].drop(columns=['status'],axis=1)
The output is,
id text
1 2 blue lights
4 5 fast animals
CodePudding user response:
here is one way to do it
# '|'.join(sw) : creates a string with a |, to form an OR condition
# \\b : adds the word boundary to the capture group
# create a pattern surrounded by the word boundary and then
# filtered out what is found using loc
df.loc[~df['text'].str.contains('\\b(' '|'.join(sw) ')\\b' )]
OR
df[df['text'].str.extract('\\b(' '|'.join(sw) ')\\b' )[0].isna()]
id text
1 2 'blue lights'
4 5 'fast animals'
CodePudding user response:
Another possible solution, which works as follows:
Split each string by space, producing a list of words
Check whether each of those lists of words is disjoint with
sw
.Use the result for boolean indexing.
df[df['text'].str.split(' ').map(lambda x: set(x).isdisjoint(sw))]
Output:
id text
1 2 blue lights
4 5 fast animals