I generate a dataframe for a doc created from a text by spacy as follow:
test='We walked the walk and still walk it today. Walking brings us great joy.'
tokens=[]
lemma=[]
pos=[]
df=pd.DataFrame()
doc=nlp(test)
for t in doc:
tokens.append(t.text)
lemma.append(t.lemma_)
pos.append(t.pos_)
df['tokens']=tokens
df['lemma']=lemma
df['pos']=pos
df
tokens lemma pos
0 We -PRON- PRON
1 walked walk VERB
2 the the DET
3 walk walk NOUN
4 and and CCONJ
5 still still ADV
6 walk walk VERB
7 it -PRON- PRON
8 today today NOUN
9 . . PUNCT
10 Walking walk VERB
11 brings bring VERB
12 us -PRON- PRON
13 great great ADJ
14 joy joy NOUN
15 . . PUNCT
And I group it by ('lemma', 'pos')
groups_multipe=df.groupby(['lemma','pos'])
I want to find all lemma which own both pos 'VERB' and 'NOUN'. I tried to use .apply()
and .fliter()
, but I'm not good at it.
For example, lemma 'walk' satisfies the requirement because it has 'VERB' and 'NOUN' in the column 'pos' at the same time.
How can I achieve it
Addition:
Finally, I achieve it in a stupid way: The intersection of sets verb and noun
Here is my code:
lemma_v=set(gm[0][0] for gm in groups_multiple if gm[0][1]=='VERB')
lemma_n=set(gm[0][0] for gm in groups_multiple if gm[0][1]=='NOUN')
lemma_vn=list(lemma_v & lemma_n)
It's so much inefficient, but I do not know any better way. Somebody has idea to improve it ?
CodePudding user response:
Use groupby_transform
to create a boolean mask and select right rows:
# custom function to check if 'lemma' is in 'VERB' and 'NOUN'
is_verb_and_noun = lambda x: set(x) == set(['VERB', 'NOUN'])
out = df.loc[df.groupby('lemma')['pos'].transform(is_verb_and_noun), 'lemma']
print(out)
# Output:
1 walk
3 walk
6 walk
10 walk
Name: lemma, dtype: object
Final output:
>>> out.unique().tolist()
['walk']