Home > Software design >  Python- Group by column by selecting first value from another column, but not if value was already c
Python- Group by column by selecting first value from another column, but not if value was already c

Time:11-28

I have a dataframe like:

Tag  Class  
A    P      
A    Q
B    P      
B    Q     
C    P      
C    Q    
C    R

I want to group by Tag and keep the first value from Class. However, if this value was used previously, look for the next value within the tag.

Expected output:

Tag  Class  
A    P      
B    Q        
C    R      

If there is no class left for the tag, then return null (or don't include Tag in output).

I have been trying to do this with drop_duplicates, but with no luck. How can I achieve this?

CodePudding user response:

We can define a custom function lets call it dedupe, which maintains an internal state in a set variable s to keep track of the previously used classes and returns the first available class for each group which is previously not used

def dedupe():
    s = set()
    def _dedupe(c):
        c = c[~c.isin(s)]
        if len(c) > 0:
            s.add(c.iat[0])
            return c.iat[0]
    return _dedupe


df.groupby('Tag', sort=False, as_index=False)['Class'].apply(dedupe())

  Tag Class
0   A     P
1   B     Q
2   C     R
  • Related