I have a csv with large number of rows.
Sample:
User_id , Marks
12 3
13 2
. .
Marks can be in range (1,5) I want to count for every user_id The count of marks received in various ranges (1-2), (2,3), (3,4), (4,5).
I used df.query() for this but it is taking a lot of time as rows are large. Please suggest optimal way to achieve this.
CodePudding user response:
I think this should work for you (Edit, I restructured the code to make it more readable):
df = pd.DataFrame({'User_id':[12, 13, 12, 13, 12, 13] , 'Marks': [3.0, 2.2, 4.9, 1.0, 3.1, 2.9]})
mark_ranges = pd.cut(
df['Marks'],
bins=[0,2,3,4,5],
labels=['1-2', '2-3', '3-4', '4-5'],
).rename('Mark Range')
result = (
df
.join(mark_ranges)
.groupby(['User_id', 'Mark Range'])
.size()
)
print(result)
output:
User_id Mark Range
12 1-2 0
2-3 1
3-4 1
4-5 1
13 1-2 1
2-3 2
3-4 0
4-5 0
dtype: int64
In this example 3 (e.g. 3.0) is put in the 2-3 group, whereas anything over 3 (e.g. 3.1) is put in 3-4. In other words the upper end of the range is inclusive.