I have a data frame with 3 columns of classes and 5 rows of students in each class. Some of these students are duplicates. I want to list the most common student names from all the classes and list them in descending order, the number of times they exist, and which the classes they exist in.
df = pd.DataFrame({
'biology': ['ryan', 'sarah', 'tom', 'ed', 'jackson'],
'statistics': ['sarah', 'ed', 'jacob', 'ryan', 'de'],
'ecology': ['austin', 'ryan', 'tom', 'sam', 'sarah']
})
biology statistics ecology
0 ryan sarah austin
1 sarah ed ryan
2 tom jacob tom
3 ed ryan sam
4 jackson de sarah
I want the output look something like this:
ryan, 3 classes, (biology, statistics, ecology)
sarah, 3 classes, (biology, statistics, ecology)
tom, 2 classes, (biology, ecology)
ed, 2 classes, (biology, statistics)
jackson, 1 class, (biology)
jacob, 1 class, (statistics)
de, 1 class, (statistics)
austin, 1 class, (ecology)
...and so on
Any help would be appreciated, I'm a beginner so I have been at this for a several hours. Brain is getting killed. Thanks!
CodePudding user response:
We can melt
the DataFrame to get to long form, then groupby aggregate
with Named Aggregation to get both the number of classes, and the names of the classes, lastly we can sort_values
to get the highest frequency students first:
output_df = (
df.melt(var_name='class name', value_name='student name')
.groupby('student name', as_index=False)
.agg(class_count=('class name', 'count'),
classes=('class name', tuple))
.sort_values('class_count', ascending=False, ignore_index=True)
)
output_df
:
student name class_count classes
0 ryan 3 (biology, statistics, ecology)
1 sarah 3 (biology, statistics, ecology)
2 ed 2 (biology, statistics)
3 tom 2 (biology, ecology)
4 austin 1 (ecology,)
5 de 1 (statistics,)
6 jackson 1 (biology,)
7 jacob 1 (statistics,)
8 sam 1 (ecology,)
We can further conditionally add classes/class to class_count
and write to_csv
:
# Conditionally Add Classes/Class
output_df['class_count'] = output_df['class_count'].astype(str) np.where(
output_df['class_count'].eq(1),
' class',
' classes'
)
# Write to CSV
output_df.to_csv('output.csv', index=False, header=None)
output.csv
:
ryan,3 classes,"('biology', 'statistics', 'ecology')"
sarah,3 classes,"('biology', 'statistics', 'ecology')"
ed,2 classes,"('biology', 'statistics')"
tom,2 classes,"('biology', 'ecology')"
austin,1 class,"('ecology',)"
de,1 class,"('statistics',)"
jackson,1 class,"('biology',)"
jacob,1 class,"('statistics',)"
sam,1 class,"('ecology',)"
Setup and imports:
import numpy as np
import pandas as pd
df = pd.DataFrame({
'biology': ['ryan', 'sarah', 'tom', 'ed', 'jackson'],
'statistics': ['sarah', 'ed', 'jacob', 'ryan', 'de'],
'ecology': ['austin', 'ryan', 'tom', 'sam', 'sarah']
})
CodePudding user response:
df = pd.DataFrame({
'biology': ['ryan', 'sarah', 'tom', 'ed', 'jackson'],
'statistics': ['sarah', 'ed', 'jacob', 'ryan', 'de'],
'ecology': ['austin', 'ryan', 'tom', 'sam', 'sarah']
})
results = {}
for h in df:
for k,v in df[h].value_counts().items():
print(k,v)
if k in results:
results[k]['value'] = v
results[k]['class'].append(h)
else:
results[k] = {'value':v,'class':[h]}
results = {h:results[h] for h in sorted(results, key=lambda x:results[x]['value'],reverse=True)}