How to filter dataframe based on single exact match in data group using Pandas-CodePudding

my df looks like this:

student_id,    coll_type,    completed    
1001           2-year        N          
1001           2-year        N           
1001           2-year        Y           
1001           4-year        Y            

1002           4-year        N              
1002           2-year        N                   
1002           2-year        N                   
1002           2-year        Y                   

1003           4-year        N                    
1003           4-year        N                    

1004           2-year        N                   
1004           2-year        Y

I am trying to accomplish 3 things:

student_id who only went to coll_type 2-year for the student_id group
student_id who only went to coll_type 4-year for the student_id group
student_id who only went to both coll_type 2-year and 4-year for the student_id group
finally, I want to have 3 different dataframe from the original dataframe

What did I do ?

I tried to do it this way to get only 2-year but it does not work 100% as expected.

 df[~(df['coll_type']=='4-year')]

the reason it does not work as expected is because it will also give me student_id who also went to 4-year but only returns their 2-year data.

how do I only get the data in which student_id only went to respective coll_type?

CodePudding user response：

You can create a category column based on coll_type as follows and then group or filter based on the category column:

def categorize(coll_type: pd.Series):
    """
    takes a Series as parameter, returns 0 if all values are 2-year
    returns 1 if all values are 4-year. Otherwise return 2
    """
    if (coll_type == '2-year').all():
        return 0
    elif (coll_type == '4-year').all():
        return 1
    else:
        return 2

df['category'] = df.coll_type.groupby(df.student_id).transform(categorize)

df
    student_id coll_type completed  category
0         1001    2-year         N         2
1         1001    2-year         N         2
2         1001    2-year         Y         2
3         1001    4-year         Y         2
4         1002    4-year         N         2
5         1002    2-year         N         2
6         1002    2-year         N         2
7         1002    2-year         Y         2
8         1003    4-year         N         1
9         1003    4-year         N         1
10        1004    2-year         N         0
11        1004    2-year         Y         0

To further get 2-year / 4-year student ids, just do a filter:

df[df.category == 0].student_id.unique()
# array([1004], dtype=int64)

df[df.category == 1].student_id.unique()
# array([1003], dtype=int64)