Select number of values from column based on condition in a different df column-CodePudding

I am working on creating a dummy dataset for testing a cloud storage and dashboard system for a university. I am currently trying to assign courses to each student id for a given term. this would be the course enrollment step in real life. Most students take a full load, 4 classes, and some take 3,2 or 1 class, with decreasing probability.

I have two pandas DataFrames, 'courses' and 'students_master'.

'courses' has 1100 rows and looks like this:

  subject_id course_id SECTION_SUBJECT        SECTION_SUBJECT_DESC  \
0        HCH   HCH-101            HPCH  Community Health Promotion   
1        HCH   HCH-102            HPCH  Community Health Promotion   
2        HCH   HCH-103            HPCH  Community Health Promotion   
3        HCH   HCH-104            HPCH  Community Health Promotion   
4        HCH   HCH-105            HPCH  Community Health Promotion

'students_master' has 27054 rows and looks like this:

 ID_year_id  cohort      ids  level num_classes
0       22180  2013FA  1001269      4           4
1       49919  2013FA  1000206      4           4
2       48206  2013FA  1000524      4           2
3       40649  2013FA  1000233      4           3
4       29733  2013FA  1000533      4           2

At this point I am trying to create a new column, students_master['selections'], where I use the number, 1-4, in the 'num_classes' column to randomly select a number of course_ids from courses['course_id']. The resulting column values would be small lists like [HCH-101, TWI-302,...]

When I use this piece of code:

list(courses['course_id'].sample(4))

it works, and results in:

['EVS-406', 'BFN-201', 'ATS-105', 'BOL-103']

I have tried using .apply as well as basic for loops with no luck. I think the most promising method is to 'vectorize'. So I wrote this .select statement:

selections=[]
conditions = [
        (students_master['num_classes']==4),
        (students_master['num_classes']==3),
        (students_master['num_classes']==2),
        (students_master['num_classes']==1)
]
choices = [
        ([list(courses['course_id'].sample(4))]),
        ([list(courses['course_id'].sample(3))]),
        ([list(courses['course_id'].sample(2))]),
        ([list(courses['course_id'].sample(1))])
]


selections.append(np.select(conditions, choices))

and it gets the error: "shape mismatch: objects cannot be broadcast to a single shape"

Any advice on how to solve this problem is greatly appreciated.

CodePudding user response：

This, you can use apply to ensure the courses are not repeated within each student:

selection = student_master['num_classes'].apply(lambda x: np.random.choice(course['course_id'], x, replace=False) )