Home > Net >  Find all value which satisfy multiple keys in Pandas groupby
Find all value which satisfy multiple keys in Pandas groupby

Time:11-25

I generate a dataframe for a doc created from a text by spacy as follow:

test='We walked the walk and still walk it today. Walking brings us great joy.'
tokens=[]
lemma=[]
pos=[]

df=pd.DataFrame()

doc=nlp(test)
for t in doc:
    tokens.append(t.text)
    lemma.append(t.lemma_)
    pos.append(t.pos_)
df['tokens']=tokens
df['lemma']=lemma
df['pos']=pos

df
     tokens   lemma    pos
0        We  -PRON-   PRON
1    walked    walk   VERB
2       the     the    DET
3      walk    walk   NOUN
4       and     and  CCONJ
5     still   still    ADV
6      walk    walk   VERB
7        it  -PRON-   PRON
8     today   today   NOUN
9         .       .  PUNCT
10  Walking    walk   VERB
11   brings   bring   VERB
12       us  -PRON-   PRON
13    great   great    ADJ
14      joy     joy   NOUN
15        .       .  PUNCT

And I group it by ('lemma', 'pos')

groups_multipe=df.groupby(['lemma','pos'])

I want to find all lemma which own both pos 'VERB' and 'NOUN'. I tried to use .apply() and .fliter(), but I'm not good at it.

For example, lemma 'walk' satisfies the requirement because it has 'VERB' and 'NOUN' in the column 'pos' at the same time.

How can I achieve it


Addition:

Finally, I achieve it in a stupid way: The intersection of sets verb and noun

Here is my code:

lemma_v=set(gm[0][0] for gm in groups_multiple if gm[0][1]=='VERB')
lemma_n=set(gm[0][0] for gm in groups_multiple if gm[0][1]=='NOUN')

lemma_vn=list(lemma_v & lemma_n)

It's so much inefficient, but I do not know any better way. Somebody has idea to improve it ?

CodePudding user response:

Use groupby_transform to create a boolean mask and select right rows:

# custom function to check if 'lemma' is in 'VERB' and 'NOUN'
is_verb_and_noun = lambda x: set(x) == set(['VERB', 'NOUN'])

out = df.loc[df.groupby('lemma')['pos'].transform(is_verb_and_noun), 'lemma']
print(out)

# Output:
1     walk
3     walk
6     walk
10    walk
Name: lemma, dtype: object

Final output:

>>> out.unique().tolist()
['walk']
  • Related