How to keep many specific strings in pandas dataframe-CodePudding

I have a dataframe with specific columns that looks like this:

colA    
['work', 'time', 'money', 'home', 'good', 'financial']    
['school', 'lazy', 'good', 'math', 'sad', 'important', 'dizzy', 'go']    
['frame', 'happy', 'feel', 'youth', 'change', 'home', 'past']    
['first', 'eat', 'good', 'hungry', 'empty', 'fool']    
['meet', 'risk', 'fire', 'angry', 'go']

ColA is string NOT list. And I have list like this:

word = ['good', 'sad', 'angry', 'feel', 'empty', 'dizzy', 'go', 'happy', 'fool', 'eat', 'past', 'lazy', 'youth', 'old', 'enjoy', 'free', 'time', 'hungry']

I want to keep the words in the list. So it should be look like this:

colA    
['time', 'good']    
['lazy', 'good', 'sad', 'dizzy', 'go']    
['happy', 'feel', 'youth', 'past']     
['eat', 'good', 'hungry', 'empty', 'fool']    
['angry, 'go']

I've tried using str.contains but getting an error:

contains() takes from 2 to 6 positional arguments but 18 were given

I'm just begginer, so sorry.

CodePudding user response：

Use ast.literal_eval with list comprehension for filter matched values:

import ast

s = set(word)
df['new'] = df['colA'].apply(lambda x: [y for y in ast.literal_eval(x) if y in s])
print (df)
                                                colA  \
0  ['work', 'time', 'money', 'home', 'good', 'fin...   
1  ['school', 'lazy', 'good', 'math', 'sad', 'imp...   
2  ['frame', 'happy', 'feel', 'youth', 'change', ...   
3  ['first', 'eat', 'good', 'hungry', 'empty', 'f...   
4        ['meet', 'risk', 'fire', 'angry', 'go']       

                                new  
0                      [time, good]  
1      [lazy, good, sad, dizzy, go]  
2        [happy, feel, youth, past]  
3  [eat, good, hungry, empty, fool]  
4                       [angry, go]

Performance comparison: With this data apply is faster like pure list comprehension:

df = pd.concat([df] * 10000, ignore_index=True)


In [26]: %timeit df['colB'] = [[w for w in literal_eval(l) if w in S] for l in df['colA']]
845 ms ± 32.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [27]: %timeit df['new'] = df['colA'].apply(lambda x: [y for y in ast.literal_eval(x) if y in s])
826 ms ± 11.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

CodePudding user response：

You can use ast.literal_eval in a list comprehension (faster than apply):

from ast import literal_eval

# using a set for efficiency (for x in LIST is slow)
S = set(word)

df['colA'] = [str([w for w in literal_eval(l) if w in S]) for l in df['colA']]

NB. the output here is a string, if you want a list use: df['colA'] = [[w for w in literal_eval(l) if w in S] for l in df['colA']].

output:

                                     colA
0                        ['time', 'good']
1  ['lazy', 'good', 'sad', 'dizzy', 'go']
2      ['happy', 'feel', 'youth', 'past']
3        ['eat', 'good', 'empty', 'fool']
4                         ['angry', 'go']

timing

the list comprehension is significantly faster than apply (tested on pandas 1.5)

df = pd.concat([df]*10000, ignore_index=True)

%%timeit
df['new'] = [[w for w in literal_eval(l) if w in S] for l in df['colA']]
674 ms ± 69.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
df['new'] = df['colA'].apply(lambda x: [y for y in literal_eval(x) if y in s])
1.04 s ± 67.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)