I have a dataframe with specific columns that looks like this:
colA
['work', 'time', 'money', 'home', 'good', 'financial']
['school', 'lazy', 'good', 'math', 'sad', 'important', 'dizzy', 'go']
['frame', 'happy', 'feel', 'youth', 'change', 'home', 'past']
['first', 'eat', 'good', 'hungry', 'empty', 'fool']
['meet', 'risk', 'fire', 'angry', 'go']
ColA is string NOT list. And I have list like this:
word = ['good', 'sad', 'angry', 'feel', 'empty', 'dizzy', 'go', 'happy', 'fool', 'eat', 'past', 'lazy', 'youth', 'old', 'enjoy', 'free', 'time', 'hungry']
I want to keep the words in the list. So it should be look like this:
colA
['time', 'good']
['lazy', 'good', 'sad', 'dizzy', 'go']
['happy', 'feel', 'youth', 'past']
['eat', 'good', 'hungry', 'empty', 'fool']
['angry, 'go']
I've tried using str.contains but getting an error:
contains() takes from 2 to 6 positional arguments but 18 were given
I'm just begginer, so sorry.
CodePudding user response:
Use ast.literal_eval
with list comprehension for filter matched values:
import ast
s = set(word)
df['new'] = df['colA'].apply(lambda x: [y for y in ast.literal_eval(x) if y in s])
print (df)
colA \
0 ['work', 'time', 'money', 'home', 'good', 'fin...
1 ['school', 'lazy', 'good', 'math', 'sad', 'imp...
2 ['frame', 'happy', 'feel', 'youth', 'change', ...
3 ['first', 'eat', 'good', 'hungry', 'empty', 'f...
4 ['meet', 'risk', 'fire', 'angry', 'go']
new
0 [time, good]
1 [lazy, good, sad, dizzy, go]
2 [happy, feel, youth, past]
3 [eat, good, hungry, empty, fool]
4 [angry, go]
Performance comparison: With this data apply
is faster like pure list comprehension:
df = pd.concat([df] * 10000, ignore_index=True)
In [26]: %timeit df['colB'] = [[w for w in literal_eval(l) if w in S] for l in df['colA']]
845 ms ± 32.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [27]: %timeit df['new'] = df['colA'].apply(lambda x: [y for y in ast.literal_eval(x) if y in s])
826 ms ± 11.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
CodePudding user response:
You can use ast.literal_eval
in a list comprehension (faster than apply
):
from ast import literal_eval
# using a set for efficiency (for x in LIST is slow)
S = set(word)
df['colA'] = [str([w for w in literal_eval(l) if w in S]) for l in df['colA']]
NB. the output here is a string, if you want a list use: df['colA'] = [[w for w in literal_eval(l) if w in S] for l in df['colA']]
.
output:
colA
0 ['time', 'good']
1 ['lazy', 'good', 'sad', 'dizzy', 'go']
2 ['happy', 'feel', 'youth', 'past']
3 ['eat', 'good', 'empty', 'fool']
4 ['angry', 'go']
timing
the list comprehension is significantly faster than apply
(tested on pandas 1.5)
df = pd.concat([df]*10000, ignore_index=True)
%%timeit
df['new'] = [[w for w in literal_eval(l) if w in S] for l in df['colA']]
674 ms ± 69.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
df['new'] = df['colA'].apply(lambda x: [y for y in literal_eval(x) if y in s])
1.04 s ± 67.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)