Home > Enterprise >  Efficient function to detect a list of thrice equal elements
Efficient function to detect a list of thrice equal elements

Time:04-14

I am looking for an efficient function to find the marks which appear at least thrice one after another without interruption.

Input example:

import pandas as pd
marks = [83, 79, 83, 83, 83, 79, 79, 83]
student_id = [101, 102, 103, 104, 105, 106, 107, 108]
d = {'student_id':student_id,'marks':marks}
df = pd.DataFrame (d)

Desired output:

83

If it's possible, I am looking for something more efficient than looping row by row with a for loop that keeps track of the previous 2 marks. That is, I'm looking for something better than the following:

def thrice_f (marks, number_of_apperances):
    cache = marks[0]
    counter = 1
    for mark in marks[1:]:
        if mark == cache:
            counter  = 1
            if counter == number_of_apperances:
                return cache
        else:
            counter = 1
        cache = mark

CodePudding user response:

You could use diff ne cumsum to identify groups of consecutive marks. Then index the marks that appear exactly 3 times consecutively:

groups = df['marks'].diff().ne(0).cumsum()
out = df.loc[groups.isin(groups.value_counts().eq(3).pipe(lambda x: x[x].index)), 'marks'].unique()

Output:

[83]

CodePudding user response:

Yes, you can use itertools.groupby():

from itertools import groupby
result = [key for key, group in groupby(marks) if len(list(group)) >= 3]
print(result)

This will give a list of all the elements that appear more than three times in a row:

[83]

If you know only one such group exists, you can use list unpacking to extract the single element:

[result] = [key for key, group in groupby(marks) if len(list(group)) >= n]

This outputs:

83

CodePudding user response:

Another solution, using only pandas:

x = (
    df.groupby((df.marks != df.marks.shift(1)).cumsum())
    .filter(lambda x: len(x) > 2)["marks"]
    .unique()
)
print(x)

Prints:

[83]

EDIT: The line (df.marks != df.marks.shift(1)).cumsum() will create series of integers that marks different consecutive groups:

0    1
1    2
2    3
3    3
4    3
5    4
6    4
7    5
Name: marks, dtype: int64

We group the df against these groups, filter only groups with size > 2 and print unique marks.

CodePudding user response:

You can use run length encoding to obtain lengths and extract relevant marks. The code below is using pdrle package for run length encoding.

import pdrle


rle = pdrle.encode(df.marks)
rle.vals.loc[rle.runs.eq(3)]
# marks
# 2    83
# Name: vals, dtype: int64

CodePudding user response:

Iterate over over the list three-at-a-time and if all three items are equal, save one of them.

>>> marks = [83, 79, 83, 83, 83, 79, 79, 83]
>>> for (a,b,c) in zip(marks,marks[1:],marks[2:]):
...     if a==b==c: print(a)
... 
83
>>>

[a for (a,b,c) in zip(marks,marks[1:],marks[2:]) if a==b==c]
  • Related