Home > database >  How to Group and get only the records with consecutive rank based on a condition in Python Pandas or
How to Group and get only the records with consecutive rank based on a condition in Python Pandas or

Time:12-26

Below is the data that I have, which has 3 columns:

  1. ID - Member ID
  2. Company : Company Name
  3. Year - Year of Joining the company
import pandas as pd
import numpy as np

data = {'ID':[1,1,1,2,2,3,3,3,3,3,4,4,4],
        'Company':['Google','Microsoft','LinkedIn','Youtube','Google','Google','Microsoft','Youtube','Google','Microsoft','Microsoft','Google','LinkedIn'],
        'Year':[2001,2004,2009,2001,2009,1999,2000,2003,2006,2010,2010,2012,2020]}

FullData = pd.DataFrame(data)


FullData - 
ID  Company   Year
1   Google    2001
1   Microsoft 2004
1   LinkedIn  2009
2   Youtube   2001
2   Google    2009
3   Google    1999
3   Microsoft 2000
3   Youtube   2003
3   Google    2006
3   Microsoft 2010
4   Microsoft 2010
4   Google    2012
4   LinkedIn  2020

Below I have grouped the data by ID and ranked it according to the Year


FullData['Rank'] = FullData.groupby('ID')['Year'].rank(method='first').astype(int)
FullData


ID  Company    Year    Rank
1   Google     2001     1
1   Microsoft  2004     2
1   LinkedIn   2009     3
2   Youtube    2001     1
2   Google     2009     2
3   Google     1999     1
3   Microsoft  2000     2
3   Youtube    2003     3
3   Google     2006     4
3   Microsoft  2010     5
4   Microsoft  2010     1
4   Google     2012     2
4   LinkedIn   2020     3

Now I need to get only the Member ID's who have joined Microsoft right after google. I need to get only the records partitioned or grouped by ID which has Company Google and Microsoft and the Rank of Google is followed by Microsoft consecutively. (Accepted Output --> Google - Rank 1 and Microsoft -Rank 2 or Google - Rank 4 and Microsoft -Rank 5 and so on..)

Below is the sample of desired output

ID  Company    Year    Rank
1   Google     2001     1
1   Microsoft  2004     2
3   Google     1999     1
3   Microsoft  2000     2
3   Google     2006     4
3   Microsoft  2010     5

OR Count of Unique ID's

Count of Unique ID's/Members who worked for Google prior to Microsoft = 2

Any help is appreciated. Thanks a million in advance

CodePudding user response:

Use boolean indexing:

def myfunc(df):
    m1 = (df['Company'].eq('Google') & df['Company'].shift(-1).eq('Microsoft'))
    m2 = (df['Rank'].eq(df['Rank'].shift(-1) - 1))
    return df[(m1 & m2) | (m1.shift() & m2.shift())]

out = FullData[FullData['Company'].isin(['Google', 'Microsoft'])] \
          .groupby('ID').apply(myfunc).droplevel(0)
print(out)

# Output:
   ID    Company  Year  Rank
0   1     Google  2001     1
1   1  Microsoft  2004     2
5   3     Google  1999     1
6   3  Microsoft  2000     2
8   3     Google  2006     4
9   3  Microsoft  2010     5

For the unique count, use out['ID'].nunique()

  • Related