Home > database >  Grouping a dataframe by a specific column followed by splitting via a specific value
Grouping a dataframe by a specific column followed by splitting via a specific value

Time:03-13

Assume I have the following raw dataframe composes of some students information:

STUDENT_ID  SEMESTER    COURSE_NAME
50694           21          X1
50694           21          X2
50285           21          X3
51442           21          X4
51600           21          X5
50285           21          X6
50494           21          X7
50494           21          X8
50802           21          X9
50802           21          X10
50802           21          X11
50285           21          X12

I want to (1) group them by the STUDENT_IDso that each student's courses are group together, to look like this:

STUDENT_ID  SEMESTER    COURSE_NAME
50694           21          X1
50694           21          X2
50494           21          X7
50494           21          X8
50285           21          X3
50285           21          X6
50285           21          X12
50802           21          X9
50802           21          X10
50802           21          X11
51442           21          X4
51600           21          X5

And (2) split them by a specifc value in the STUDENT_ID column, say 50802 to get:

STUDENT_ID  SEMESTER    COURSE_NAME
50694           21          X1
50694           21          X2
50494           21          X7
50494           21          X8
50285           21          X3
50285           21          X6
50285           21          X12

and

STUDENT_ID  SEMESTER    COURSE_NAME
50802           21          X9
50802           21          X10
50802           21          X11
51442           21          X4
51600           21          X5

Any help is highly appreciated.

CodePudding user response:

Use df.sort_values to sort the dataframe, and then use np.split to split the dataframe at the specified index(es):

df = df.sort_values('STUDENT_ID').reset_index(drop=True)
splits = np.split(df, [df['STUDENT_ID'].eq(50802).idxmax()])

Output:

>>> splits
[   STUDENT_ID  SEMESTER COURSE_NAME
 0       50285        21          X3
 1       50285        21          X6
 2       50285        21         X12
 3       50494        21          X7
 4       50494        21          X8
 5       50694        21          X1
 6       50694        21          X2,
     STUDENT_ID  SEMESTER COURSE_NAME
 7        50802        21          X9
 8        50802        21         X10
 9        50802        21         X11
 10       51442        21          X4
 11       51600        21          X5]

>>> splits[0]
   STUDENT_ID  SEMESTER COURSE_NAME
0       50285        21          X3
1       50285        21          X6
2       50285        21         X12
3       50494        21          X7
4       50494        21          X8
5       50694        21          X1
6       50694        21          X2

>>> splits[1]
    STUDENT_ID  SEMESTER COURSE_NAME
7        50802        21          X9
8        50802        21         X10
9        50802        21         X11
10       51442        21          X4
11       51600        21          X5

CodePudding user response:

The first step is a simple sort_values. For the second one, the condition is unclear but you could use groupby:

dfs = [g for _,g in
(df
 .sort_values(by='STUDENT_ID')
 .groupby(df['STUDENT_ID'].lt(50802))
)]

Output:

[    STUDENT_ID  SEMESTER COURSE_NAME
 8        50802        21          X9
 9        50802        21         X10
 10       50802        21         X11
 3        51442        21          X4
 4        51600        21          X5,
     STUDENT_ID  SEMESTER COURSE_NAME
 2        50285        21          X3
 5        50285        21          X6
 11       50285        21         X12
 6        50494        21          X7
 7        50494        21          X8
 0        50694        21          X1
 1        50694        21          X2]
  • Related