I have a dataframe like this
Student ID | Subject | SID | Result |
---|---|---|---|
101 | English | 1 | Pass |
101 | English | 1 | Pass |
101 | Mathematics | 3 | Pass |
101 | Mathematics | 3 | Pass |
101 | Mathematics | 33 | Fail |
101 | Mathematics | 33 | Fail |
102 | English | 1 | Pass |
102 | English | 1 | Pass |
102 | Mathematics | 3 | Fail |
102 | Mathematics | 3 | Fail |
I want an output like this
Subject | SID | Pass | Fail | Failed_student_ID |
---|---|---|---|---|
English | 1 | 2 | 0 | |
Mathematics | 3 | 1 | 1 | 102 |
Mathematics | 33 | 0 | 1 | 101 |
I have a large dataset and want to take the result with respect to subject and subject ID on how many people passed and failed.
How do I write in python to get this kind of data frame?
CodePudding user response:
You can use pivot_table
with aggfunc of len
to get count of Pass
, Fail
and reset_index
df = df.drop_duplicates(ignore_index=True)
df = (df.drop('Student ID', axis=1).
pivot_table(index=['Subject', 'SID'],
columns='Result',
aggfunc=len,
fill_value=0).
rename_axis('', axis=1).
reset_index()
)
df['Total Students'] = df['Pass'] df['Fail']
print(df):
Subject SID Fail Pass Total Students
0 English 1 0 2 2
1 Mathematics 3 1 1 2
2 Mathematics 33 1 0 1
Failed Student ID:
If you need Failed Student ID, you need to do more - include 'Student ID' in the pivot_table with list
aggfunc and do some rearranging of columns:
df = (df.
pivot_table(index=['Subject', 'SID'],
columns=['Result'],
values=['Student ID'],
aggfunc={'Result':len,'Student ID':list},
fill_value=0)
)
df.columns = df.columns.to_flat_index()
df = df.drop(('Student ID', 'Pass'),axis=1)
df.columns = ['Fail', 'Pass', 'Failed_Student_ID']
df = df.reset_index()
print(df):
Subject SID Fail Pass Failed_Student_ID
0 English 1 0 2 0
1 Mathematics 3 1 1 [102]
2 Mathematics 33 1 0 [101]
CodePudding user response:
You can use the function get_dummies and group by the columns "Student ID" and "Subject", something like this:
marks = pd.DataFrame({
'student': ['100','100', '101'],
'subject': ['math', 'math', 'math'],
'result': ['pass', 'pass', 'failed']
})
marks[['failed','pass']] = pd.get_dummies(marks.result)
marks.groupby(['student','subject']).sum()
CodePudding user response:
You can drop all the duplicates in your dataframe, then using groupby to count the values of the Pass and Fail in every subject.
result = df.drop_duplicates().groupby(['Subject','SID'])['Result'].value_counts().unstack(fill_value=0).reset_index()
Result Subject SID Fail Pass
0 English 1 0 2
1 Mathematics 3 1 1
2 Mathematics 33 1 0
result['Total student'] = result[['Fail','Pass']].sum(axis=1)