Home > other >  Remove rows in dataframe that do not contain ALL items in a list
Remove rows in dataframe that do not contain ALL items in a list

Time:06-06

I have a list of letters:

letters = ['E', 'H', 'T', 'D']

I have a dataframe with the following rows:

    letter_1 letter_2 letter_3 letter_4 letter_5   word
0        D        E        B        U        T    DEBUT
1        D        E        B        U        G    DEBUG
2        B        E        G        E        T    BEGET
3        D        E        P        T        H    DEPTH
4        D        U        V        E        T    DUVET

I am trying to filter out all rows that do not contain ALL of the items in the letters list.

CodePudding user response:

You can use set operations:

df[df.filter(like='letter').agg(set, axis=1) >= set(letters)]

or using the "word":

df[df['word'].agg(set) >= set(letters)]

output:

  letter_1 letter_2 letter_3 letter_4 letter_5   word
3        D        E        P        T        H  DEPTH

CodePudding user response:

Another approach using and broadcasting (this performs all comparisons and ensure there is at least 1 match for each letter):

m = (df.filter(like='letter').to_numpy()==np.array(letters)[:,None,None]
     ).any(2).all(0)
df[m]

output:

  letter_1 letter_2 letter_3 letter_4 letter_5   word
3        D        E        P        T        H  DEPTH

CodePudding user response:

Another option is to use numpy.in1d

df[df.word.apply(lambda x: np.in1d(letters, list(x)).all())]
 
  letter_1 letter_2 letter_3 letter_4 letter_5   word
3        D        E        P        T        H  DEPTH

CodePudding user response:

Another method:

df[df['word'].apply(lambda x: all(s in x for s in letters))]
index letter_1 letter_2 letter_3 letter_4 letter_5 word
3 D E P T H DEPTH
  • Related