I am working on creating a dummy dataset for testing a cloud storage and dashboard system for a university. I am currently trying to assign courses to each student id for a given term. this would be the course enrollment step in real life. Most students take a full load, 4 classes, and some take 3,2 or 1 class, with decreasing probability.
I have two pandas DataFrames, 'courses' and 'students_master'.
'courses' has 1100 rows and looks like this:
subject_id course_id SECTION_SUBJECT SECTION_SUBJECT_DESC \
0 HCH HCH-101 HPCH Community Health Promotion
1 HCH HCH-102 HPCH Community Health Promotion
2 HCH HCH-103 HPCH Community Health Promotion
3 HCH HCH-104 HPCH Community Health Promotion
4 HCH HCH-105 HPCH Community Health Promotion
'students_master' has 27054 rows and looks like this:
ID_year_id cohort ids level num_classes
0 22180 2013FA 1001269 4 4
1 49919 2013FA 1000206 4 4
2 48206 2013FA 1000524 4 2
3 40649 2013FA 1000233 4 3
4 29733 2013FA 1000533 4 2
At this point I am trying to create a new column, students_master['selections'], where I use the number, 1-4, in the 'num_classes' column to randomly select a number of course_ids from courses['course_id']. The resulting column values would be small lists like [HCH-101, TWI-302,...]
When I use this piece of code:
list(courses['course_id'].sample(4))
it works, and results in:
['EVS-406', 'BFN-201', 'ATS-105', 'BOL-103']
I have tried using .apply as well as basic for loops with no luck. I think the most promising method is to 'vectorize'. So I wrote this .select statement:
selections=[]
conditions = [
(students_master['num_classes']==4),
(students_master['num_classes']==3),
(students_master['num_classes']==2),
(students_master['num_classes']==1)
]
choices = [
([list(courses['course_id'].sample(4))]),
([list(courses['course_id'].sample(3))]),
([list(courses['course_id'].sample(2))]),
([list(courses['course_id'].sample(1))])
]
selections.append(np.select(conditions, choices))
and it gets the error: "shape mismatch: objects cannot be broadcast to a single shape"
Any advice on how to solve this problem is greatly appreciated.
CodePudding user response:
This, you can use apply
to ensure the courses are not repeated within each student:
selection = student_master['num_classes'].apply(lambda x: np.random.choice(course['course_id'], x, replace=False) )