I am looking for an efficient function to find the marks which appear at least thrice one after another without interruption.
Input example:
import pandas as pd
marks = [83, 79, 83, 83, 83, 79, 79, 83]
student_id = [101, 102, 103, 104, 105, 106, 107, 108]
d = {'student_id':student_id,'marks':marks}
df = pd.DataFrame (d)
Desired output:
83
If it's possible, I am looking for something more efficient than looping row by row with a for
loop that keeps track of the previous 2 marks. That is, I'm looking for something better than the following:
def thrice_f (marks, number_of_apperances):
cache = marks[0]
counter = 1
for mark in marks[1:]:
if mark == cache:
counter = 1
if counter == number_of_apperances:
return cache
else:
counter = 1
cache = mark
CodePudding user response:
You could use diff
ne
cumsum
to identify groups of consecutive marks. Then index the marks that appear exactly 3 times consecutively:
groups = df['marks'].diff().ne(0).cumsum()
out = df.loc[groups.isin(groups.value_counts().eq(3).pipe(lambda x: x[x].index)), 'marks'].unique()
Output:
[83]
CodePudding user response:
Yes, you can use itertools.groupby()
:
from itertools import groupby
result = [key for key, group in groupby(marks) if len(list(group)) >= 3]
print(result)
This will give a list of all the elements that appear more than three times in a row:
[83]
If you know only one such group exists, you can use list unpacking to extract the single element:
[result] = [key for key, group in groupby(marks) if len(list(group)) >= n]
This outputs:
83
CodePudding user response:
Another solution, using only pandas
:
x = (
df.groupby((df.marks != df.marks.shift(1)).cumsum())
.filter(lambda x: len(x) > 2)["marks"]
.unique()
)
print(x)
Prints:
[83]
EDIT: The line (df.marks != df.marks.shift(1)).cumsum()
will create series of integers that marks different consecutive groups:
0 1
1 2
2 3
3 3
4 3
5 4
6 4
7 5
Name: marks, dtype: int64
We group the df
against these groups, filter only groups with size > 2 and print unique marks.
CodePudding user response:
You can use run length encoding to obtain lengths and extract relevant marks
. The code below is using pdrle
package for run length encoding.
import pdrle
rle = pdrle.encode(df.marks)
rle.vals.loc[rle.runs.eq(3)]
# marks
# 2 83
# Name: vals, dtype: int64
CodePudding user response:
Iterate over over the list three-at-a-time and if all three items are equal, save one of them.
>>> marks = [83, 79, 83, 83, 83, 79, 79, 83]
>>> for (a,b,c) in zip(marks,marks[1:],marks[2:]):
... if a==b==c: print(a)
...
83
>>>
[a for (a,b,c) in zip(marks,marks[1:],marks[2:]) if a==b==c]