How to query a dataframe optimally?-CodePudding

I have a csv with large number of rows.

Sample:

User_id , Marks
12    3
13    2
.     .

Marks can be in range (1,5) I want to count for every user_id The count of marks received in various ranges (1-2), (2,3), (3,4), (4,5).

I used df.query() for this but it is taking a lot of time as rows are large. Please suggest optimal way to achieve this.

CodePudding user response：

I think this should work for you (Edit, I restructured the code to make it more readable):

df = pd.DataFrame({'User_id':[12, 13, 12, 13, 12, 13] , 'Marks': [3.0, 2.2, 4.9, 1.0, 3.1, 2.9]})

mark_ranges = pd.cut(
    df['Marks'],
    bins=[0,2,3,4,5],
    labels=['1-2', '2-3', '3-4', '4-5'],    
).rename('Mark Range')

result = (
    df
    .join(mark_ranges)
    .groupby(['User_id', 'Mark Range'])
    .size()
)
print(result)

output:

User_id  Mark Range
12       1-2           0
         2-3           1
         3-4           1
         4-5           1
13       1-2           1
         2-3           2
         3-4           0
         4-5           0
dtype: int64

In this example 3 (e.g. 3.0) is put in the 2-3 group, whereas anything over 3 (e.g. 3.1) is put in 3-4. In other words the upper end of the range is inclusive.