my df
looks like this:
student_id, coll_type, completed
1001 2-year N
1001 2-year N
1001 2-year Y
1001 4-year Y
1002 4-year N
1002 2-year N
1002 2-year N
1002 2-year Y
1003 4-year N
1003 4-year N
1004 2-year N
1004 2-year Y
I am trying to accomplish 3 things:
- student_id who only went to
coll_type
2-year
for the student_id group - student_id who only went to
coll_type
4-year
for the student_id group - student_id who only went to both
coll_type
2-year
and4-year
for the student_id group - finally, I want to have 3 different dataframe from the original dataframe
What did I do ?
I tried to do it this way to get only 2-year
but it does not work 100% as expected.
df[~(df['coll_type']=='4-year')]
the reason it does not work as expected is because it will also give me student_id
who also went to 4-year
but only returns their 2-year
data.
how do I only get the data in which student_id only went to respective coll_type
?
CodePudding user response:
You can create a category column based on coll_type
as follows and then group or filter based on the category
column:
def categorize(coll_type: pd.Series):
"""
takes a Series as parameter, returns 0 if all values are 2-year
returns 1 if all values are 4-year. Otherwise return 2
"""
if (coll_type == '2-year').all():
return 0
elif (coll_type == '4-year').all():
return 1
else:
return 2
df['category'] = df.coll_type.groupby(df.student_id).transform(categorize)
df
student_id coll_type completed category
0 1001 2-year N 2
1 1001 2-year N 2
2 1001 2-year Y 2
3 1001 4-year Y 2
4 1002 4-year N 2
5 1002 2-year N 2
6 1002 2-year N 2
7 1002 2-year Y 2
8 1003 4-year N 1
9 1003 4-year N 1
10 1004 2-year N 0
11 1004 2-year Y 0
To further get 2-year / 4-year student ids, just do a filter:
df[df.category == 0].student_id.unique()
# array([1004], dtype=int64)
df[df.category == 1].student_id.unique()
# array([1003], dtype=int64)