Home > database >  Pandas: how to filter out rows containing a string pattern within a list in a column?
Pandas: how to filter out rows containing a string pattern within a list in a column?

Time:05-14

I have a data frame that looks similar to the following:

df = pd.DataFrame({
    'employee_id' : [123, 456, 789],
    'country_code' : ['US', 'CAN', 'MEX'],
    'comments' : (['good performer', 'due for raise', 'should be promoted'],
                 ['bad performer', 'should be fired', 'speak to HR'],
                 ['recently hired', 'needs training', 'shows promise'])
})

df

    employee_id   country_code   comments
0   123           US             [good performer, due for raise, should be promoted]
1   456           CAN            [bad performer, should be fired, speak to HR]
2   789           MEX            [recently hired, needs training, shows promise]

I would like to be able to filter the comments column to remove any rows containing the string 'performer'. To do so, I'm using:

df = df[~df['comments'].str.contains('performer')]

But, this returns an error:

TypeError: ufunc 'invert' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

Thanks in advance for any assistance you can give!

CodePudding user response:

if IIUC You need to break the comments column down into a string instead of a list

df = pd.DataFrame({
    'employee_id' : [123, 456, 789],
    'country_code' : ['US', 'CAN', 'MEX'],
    'comments' : (['good performer', 'due for raise', 'should be promoted'],
                 ['bad performer', 'should be fired', 'speak to HR'],
                 ['recently hired', 'needs training', 'shows promise'])
})
df['comments'] = df['comments'].apply(lambda x : ' '.join(x))
df = df[~df['comments'].str.contains('performer')]
df

CodePudding user response:

As you have lists in your Series, you cannot vectorize. You can use a list comprehension:

df2 = df[[all('performer' not in x for x in l)
          for l in df['comments']]]

Output:

   employee_id country_code                                         comments
2          789          MEX  [recently hired, needs training, shows promise]

CodePudding user response:

You could concatenate the list into one string first using apply, and then test for the word you're interested in:

df=df[~df['comments'].apply(lambda x: ' '.join(x)).str.contains('performer')]
  • Related