Home > Software engineering >  How to filter values in columns of df that contains all substring in passed list in Pandas DataFrame
How to filter values in columns of df that contains all substring in passed list in Pandas DataFrame

Time:12-09

Any suggestion how to stay only values that contains all substring that in list for any column?:

import pandas as pd

df = pd.DataFrame(
    [
    [1, 'foollish', 'molish'], 
    [2, 'barnylishon', 'chacha'], 
    [3, 'bazon', 'gazon'],
    ], 
    columns=['id', 'value_1', 'value_2'])

print (df)
    
search_list = ['a','on']

print ("Desire result for value_1 column:")

df_desire_result = pd.DataFrame(
    [
    [1, 'barnylishon', 'chacha'], 
    [2, 'bazon', 'gazon'], 
    ], 
    columns=['id', 'value_1', 'value_2'])

print (df_desire_result)

CodePudding user response:

From this statement that contains all substring that in list for any column? : I figure that if any column in a row has all the substrings in the search_list then retain that row and drop remaining rows.

Then IIUC:

cols = df.columns.drop('id').tolist()
m = df[cols].apply(lambda x: all([any(x.str.contains(s)) for s in search_list]), axis=1)
out = df[m]

print(out):

   id      value_1 value_2
1   2  barnylishon  chacha
2   3        bazon   gazon

CodePudding user response:

You can use:

# craft regex pattern
import re
pattern = '|'.join(map(re.escape, search_list))
# 'a|on'

out = df.loc[(df
   # extract words from all cells
   .filter(like='value')
   .stack()
   .str.extractall(fr'({pattern})')[0]
   # ensure that each word is present at least once per row
   .groupby(level=0).nunique()
   .eq(len(search_list))
   .reindex(df.index, fill_value=False)
 )]

print(out)

Output:

   id      value_1 value_2
1   2  barnylishon  chacha
2   3        bazon   gazon
  • Related