Efficient function to detect a list of thrice equal elements-CodePudding

I am looking for an efficient function to find the marks which appear at least thrice one after another without interruption.

Input example:

import pandas as pd
marks = [83, 79, 83, 83, 83, 79, 79, 83]
student_id = [101, 102, 103, 104, 105, 106, 107, 108]
d = {'student_id':student_id,'marks':marks}
df = pd.DataFrame (d)

Desired output:

If it's possible, I am looking for something more efficient than looping row by row with a for loop that keeps track of the previous 2 marks. That is, I'm looking for something better than the following:

def thrice_f (marks, number_of_apperances):
    cache = marks[0]
    counter = 1
    for mark in marks[1:]:
        if mark == cache:
            counter  = 1
            if counter == number_of_apperances:
                return cache
        else:
            counter = 1
        cache = mark

CodePudding user response：

You could use diff ne cumsum to identify groups of consecutive marks. Then index the marks that appear exactly 3 times consecutively:

groups = df['marks'].diff().ne(0).cumsum()
out = df.loc[groups.isin(groups.value_counts().eq(3).pipe(lambda x: x[x].index)), 'marks'].unique()

Output:

[83]

CodePudding user response：

Yes, you can use itertools.groupby():

from itertools import groupby
result = [key for key, group in groupby(marks) if len(list(group)) >= 3]
print(result)

This will give a list of all the elements that appear more than three times in a row:

[83]

If you know only one such group exists, you can use list unpacking to extract the single element:

[result] = [key for key, group in groupby(marks) if len(list(group)) >= n]

This outputs:

CodePudding user response：

Another solution, using only pandas:

x = (
    df.groupby((df.marks != df.marks.shift(1)).cumsum())
    .filter(lambda x: len(x) > 2)["marks"]
    .unique()
)
print(x)

Prints:

[83]

EDIT: The line (df.marks != df.marks.shift(1)).cumsum() will create series of integers that marks different consecutive groups:

0    1
1    2
2    3
3    3
4    3
5    4
6    4
7    5
Name: marks, dtype: int64

We group the df against these groups, filter only groups with size > 2 and print unique marks.

CodePudding user response：

You can use run length encoding to obtain lengths and extract relevant marks. The code below is using pdrle package for run length encoding.

import pdrle


rle = pdrle.encode(df.marks)
rle.vals.loc[rle.runs.eq(3)]
# marks
# 2    83
# Name: vals, dtype: int64

CodePudding user response：

Iterate over over the list three-at-a-time and if all three items are equal, save one of them.

>>> marks = [83, 79, 83, 83, 83, 79, 79, 83]
>>> for (a,b,c) in zip(marks,marks[1:],marks[2:]):
...     if a==b==c: print(a)
... 
83
>>>

[a for (a,b,c) in zip(marks,marks[1:],marks[2:]) if a==b==c]