Home > Net >  Compare same column rows for an ID and reduce the rows into desired output
Compare same column rows for an ID and reduce the rows into desired output

Time:09-20

I have a dataframe like this

Student ID Subject SID Result
101 English 1 Pass
101 English 1 Pass
101 Mathematics 3 Pass
101 Mathematics 3 Pass
101 Mathematics 33 Fail
101 Mathematics 33 Fail
102 English 1 Pass
102 English 1 Pass
102 Mathematics 3 Fail
102 Mathematics 3 Fail

I want an output like this

Subject SID Pass Fail Failed_student_ID
English 1 2 0
Mathematics 3 1 1 102
Mathematics 33 0 1 101

I have a large dataset and want to take the result with respect to subject and subject ID on how many people passed and failed.

How do I write in python to get this kind of data frame?

CodePudding user response:

You can use pivot_table with aggfunc of len to get count of Pass, Fail and reset_index

df = df.drop_duplicates(ignore_index=True)
df = (df.drop('Student ID', axis=1).
    pivot_table(index=['Subject', 'SID'],
               columns='Result',
               aggfunc=len,
               fill_value=0).
    rename_axis('', axis=1).
    reset_index()
     )
df['Total Students'] = df['Pass']   df['Fail']

print(df):

       Subject  SID  Fail  Pass  Total Students
0      English    1     0     2               2
1  Mathematics    3     1     1               2
2  Mathematics   33     1     0               1

Failed Student ID:

If you need Failed Student ID, you need to do more - include 'Student ID' in the pivot_table with list aggfunc and do some rearranging of columns:

df = (df.
    pivot_table(index=['Subject', 'SID'],
               columns=['Result'],
               values=['Student ID'],
               aggfunc={'Result':len,'Student ID':list},
               fill_value=0)
     )
df.columns = df.columns.to_flat_index()
df = df.drop(('Student ID', 'Pass'),axis=1)
df.columns = ['Fail', 'Pass', 'Failed_Student_ID']
df = df.reset_index()

print(df):

       Subject  SID  Fail  Pass Failed_Student_ID
0      English    1     0     2                 0
1  Mathematics    3     1     1             [102]
2  Mathematics   33     1     0             [101]

CodePudding user response:

You can use the function get_dummies and group by the columns "Student ID" and "Subject", something like this:

marks = pd.DataFrame({
    'student':   ['100','100', '101'],
    'subject': ['math', 'math', 'math'],
    'result': ['pass', 'pass', 'failed']
})
marks[['failed','pass']] = pd.get_dummies(marks.result)
marks.groupby(['student','subject']).sum()

CodePudding user response:

You can drop all the duplicates in your dataframe, then using groupby to count the values of the Pass and Fail in every subject.

result = df.drop_duplicates().groupby(['Subject','SID'])['Result'].value_counts().unstack(fill_value=0).reset_index()

Result      Subject  SID  Fail  Pass
0           English    1     0     2
1       Mathematics    3     1     1
2       Mathematics   33     1     0

result['Total student'] = result[['Fail','Pass']].sum(axis=1)
  • Related