Home > Mobile >  Drop columns which contains specific words (not as a substring)
Drop columns which contains specific words (not as a substring)


I have the following data frame, df:

id     text
1      'a little table'
2      'blue lights'
3      'food and drink'
4      'build an atom'
5      'fast animals' 

and a list of stop words, that is:

sw = ['a', 'an', 'and']

I want to delete the lines that contain at least one of the stop words (as words themselves, not as substrings). That is, the result I would like is:

id     text
2      'blue lights'
5      'fast animals' 

I was trying with:

df[~df['text'].str.contains('|'.join(sw), regex=True, na=False)]

but it doesn't seem to work, as it works with substrings this way, and a is substring of all texts (except for 'blue lights'). How should I change my line of code?

CodePudding user response:

li = ['a', 'an', 'and']
for i in li:
    for k in df.index:
        if i in df.text[k].split():

CodePudding user response:

If you want to use str.contains, you could try as follows:

import pandas as pd

data = {'id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5}, 
        'text': {0: "'a little table'", 1: "'blue lights'", 
                 2: "'food and drink'", 3: "'build an atom'", 
                 4: "'fast animals'"}}
df = pd.DataFrame(data)

sw = ['a', 'an', 'and']
res = df[~df['text'].str.contains(fr'\b(?:{"|".join(sw)})\b', 
                                  regex=True, na=False)]


   id            text
1   2   'blue lights'
4   5  'fast animals'

In the regex pattern \b asserts position at a word boundary, while ?: at start of pattern between (...) creates a non-capturing group. Strictly speaking, you could do without ?:, but it suppresses a Userwarning: "This pattern ... has match groups etc.". `

CodePudding user response:

You can also use the custom apply() method,

def string_present(List,string):
    return any(ele ' ' in string for ele in List)

df['status'] = df['text'].apply(lambda row: string_present(sw,row))

The output is,

   id          text
1   2   blue lights
4   5  fast animals

CodePudding user response:

here is one way to do it

# '|'.join(sw)  : creates a string with a |, to form an OR condition
# \\b : adds the word boundary to the capture group
# create a pattern surrounded by the word boundary and then 
# filtered out what is found using loc
df.loc[~df['text'].str.contains('\\b('  '|'.join(sw)   ')\\b' )]


df[df['text'].str.extract('\\b('  '|'.join(sw)   ')\\b' )[0].isna()]
    id  text
1   2   'blue lights'
4   5   'fast animals'

CodePudding user response:

Another possible solution, which works as follows:

  1. Split each string by space, producing a list of words

  2. Check whether each of those lists of words is disjoint with sw.

  3. Use the result for boolean indexing.

df[df['text'].str.split(' ').map(lambda x: set(x).isdisjoint(sw))]


   id          text
1   2   blue lights
4   5  fast animals
  • Related