NLP - drop row with specific feature in a dataframe

I have a dataframe (df) with a colum "x" (type: string). I need to drop the rows with less than 10 characters, except if the text has any of the words contained in the list:

Ih need something like this:

list = ['caro', 'custo', 'valor']
if df['x'] contain any word from the list:
   return df
else:
    return df[df['x'].apply(lambda x: len(str(x)) >10)]

CodePudding user response：

Try this:

df[df.x.apply(lambda x: len(x.split()) > 10 or any([word in list for word in x.split()]))]

And please do not use word list as a variable name)

CodePudding user response：

maybe first create function which checks len() >= 10 and checks words from list and later use apply() to filter rows - without using if/else

your_words = ['caro', 'custo', 'valor']

def check(text):
    return (len(text) >= 10) or any(word in text for word in your_words)

mask = df['x'].apply(check)

selected_df = df[ mask ]

You can also convert list to string caro|custo|valor and use as regex in `.str.contains(regex)

regex = '|'.join(your_words)
#print(regex)

mask1 = df['x'].str.contains(regex)
mask2 = df['x'].str.len() >= 10

selected_df = df[ (mask1 | mask2) ]

Minimal working example

import pandas as pd

data = {
    'x': ['ABC','caro','very long text', 'a valor'], 
}

df = pd.DataFrame(data)

your_words = ['caro', 'custo', 'valor']

# --- version 1 ---

def check(text):
    return (len(text) >= 10) or any(word in text for word in your_words)

mask = df['x'].apply(check)

selected_df = df[ mask ]
print(selected_df)

# --- version 1 ---

regex = '|'.join(your_words)
print('regex:', regex)

mask1 = df['x'].str.contains(regex)
mask2 = df['x'].str.len() >= 10
selected_df = df[ (mask1 | mask2) ]

print(selected_df)