I have a dataframe (df) with a colum "x" (type: string). I need to drop the rows with less than 10 characters, except if the text has any of the words contained in the list:
Ih need something like this:
list = ['caro', 'custo', 'valor']
if df['x'] contain any word from the list:
return df
else:
return df[df['x'].apply(lambda x: len(str(x)) >10)]
CodePudding user response:
Try this:
df[df.x.apply(lambda x: len(x.split()) > 10 or any([word in list for word in x.split()]))]
And please do not use word list
as a variable name)
CodePudding user response:
maybe first create function which checks len() >= 10
and checks words from list and later use apply()
to filter rows - without using if/else
your_words = ['caro', 'custo', 'valor']
def check(text):
return (len(text) >= 10) or any(word in text for word in your_words)
mask = df['x'].apply(check)
selected_df = df[ mask ]
You can also convert list to string caro|custo|valor
and use as regex in `.str.contains(regex)
regex = '|'.join(your_words)
#print(regex)
mask1 = df['x'].str.contains(regex)
mask2 = df['x'].str.len() >= 10
selected_df = df[ (mask1 | mask2) ]
Minimal working example
import pandas as pd
data = {
'x': ['ABC','caro','very long text', 'a valor'],
}
df = pd.DataFrame(data)
your_words = ['caro', 'custo', 'valor']
# --- version 1 ---
def check(text):
return (len(text) >= 10) or any(word in text for word in your_words)
mask = df['x'].apply(check)
selected_df = df[ mask ]
print(selected_df)
# --- version 1 ---
regex = '|'.join(your_words)
print('regex:', regex)
mask1 = df['x'].str.contains(regex)
mask2 = df['x'].str.len() >= 10
selected_df = df[ (mask1 | mask2) ]
print(selected_df)