Compare same column rows for an ID and reduce the rows into desired output-CodePudding

I have a dataframe like this

Student ID	Subject	SID	Result
101	English	1	Pass
101	English	1	Pass
101	Mathematics	3	Pass
101	Mathematics	3	Pass
101	Mathematics	33	Fail
101	Mathematics	33	Fail
102	English	1	Pass
102	English	1	Pass
102	Mathematics	3	Fail
102	Mathematics	3	Fail

I want an output like this

Subject	SID	Pass	Fail	Failed_student_ID
English	1	2	0
Mathematics	3	1	1	102
Mathematics	33	0	1	101

I have a large dataset and want to take the result with respect to subject and subject ID on how many people passed and failed.

How do I write in python to get this kind of data frame?

CodePudding user response：

You can use pivot_table with aggfunc of len to get count of Pass, Fail and reset_index

df = df.drop_duplicates(ignore_index=True)
df = (df.drop('Student ID', axis=1).
    pivot_table(index=['Subject', 'SID'],
               columns='Result',
               aggfunc=len,
               fill_value=0).
    rename_axis('', axis=1).
    reset_index()
     )
df['Total Students'] = df['Pass']   df['Fail']

print(df):

       Subject  SID  Fail  Pass  Total Students
0      English    1     0     2               2
1  Mathematics    3     1     1               2
2  Mathematics   33     1     0               1

Failed Student ID:

If you need Failed Student ID, you need to do more - include 'Student ID' in the pivot_table with list aggfunc and do some rearranging of columns:

df = (df.
    pivot_table(index=['Subject', 'SID'],
               columns=['Result'],
               values=['Student ID'],
               aggfunc={'Result':len,'Student ID':list},
               fill_value=0)
     )
df.columns = df.columns.to_flat_index()
df = df.drop(('Student ID', 'Pass'),axis=1)
df.columns = ['Fail', 'Pass', 'Failed_Student_ID']
df = df.reset_index()

print(df):

       Subject  SID  Fail  Pass Failed_Student_ID
0      English    1     0     2                 0
1  Mathematics    3     1     1             [102]
2  Mathematics   33     1     0             [101]

CodePudding user response：

You can use the function get_dummies and group by the columns "Student ID" and "Subject", something like this:

marks = pd.DataFrame({
    'student':   ['100','100', '101'],
    'subject': ['math', 'math', 'math'],
    'result': ['pass', 'pass', 'failed']
})
marks[['failed','pass']] = pd.get_dummies(marks.result)
marks.groupby(['student','subject']).sum()

CodePudding user response：

You can drop all the duplicates in your dataframe, then using groupby to count the values of the Pass and Fail in every subject.

result = df.drop_duplicates().groupby(['Subject','SID'])['Result'].value_counts().unstack(fill_value=0).reset_index()

Result      Subject  SID  Fail  Pass
0           English    1     0     2
1       Mathematics    3     1     1
2       Mathematics   33     1     0

result['Total student'] = result[['Fail','Pass']].sum(axis=1)