Similar time submissions for tests-CodePudding

I have a dataframe for each of my online students shaped like this:

df=pd.DataFrame(columns=["Class","Student","Challenge","Time"])

I create a temporary df for each of the challenges like this:

for ch in challenges:
    df=main_df.loc[(df['Challenge'] == ch)]

I want to group the records that are 5 minutes apart from each other. The idea is that I will create a df that shows a group of students working together and answering the questions relatively the same time. I have different aspects of catching cheaters but I need to be able to show that two or more students have been answering the questions relatively the same time.

I was thinking of using resampling or using a grouper method but I'm a bit confused on how to implement it. Could someone guide me to the right direction for solving for this?

Sample df:

df = pd.DataFrame(
    data={
        "Class": ["A"] * 4   ["B"] * 2,
        "Student": ['Scooby','Daphne','Shaggy','Fred','Velma','Scrappy'],
        "Challenge": ["Challenge3"] *6,
        "Time": ['07/10/2022 08:22:44','07/10/2022 08:27:22','07/10/2022 08:27:44','07/10/2022 08:29:55','07/10/2022 08:33:14','07/10/2022 08:48:44'],
    }   
)

EDIT:

For the output I was thinking of having another df with an extra column called 'Grouping' and have an incremental number for the groups that were discovered. The final df would first be empty and then appended or concated with the grouping df.

new_df = pd.dataframe(columns=['Class','Student','Challenge','Time','Grouping'])

new_df = pd.concat([new_df,df])

  Class  Student   Challenge                 Time  Grouping
1     A   Daphne  Challenge3  07/10/2022 08:27:22         1
2     A   Shaggy  Challenge3  07/10/2022 08:27:44         1
3     A     Fred  Challenge3  07/10/2022 08:29:55         1

The purpose of this is so that I may have multiple samples to verify if the same two or more students are sharing answers.

EDIT 2:

I was thinking that I could do a lambda based operation. What if I created another column called "Threshold_Delta" that calculates the difference in time from one answer to another? Then I need to figure out how to group those minutes up.

df['Challenge_Delta'] = (df['Time']-df['Time'].shift())

CodePudding user response：

This solution implements a rolling time list. The logic is this – keep adding entries as long as all the entries in the list are within the time window. When you detect that some of the entries are not in the time window relative to the most current (to be added) entry then that indicates that you have a unique cheater (maybe) group. Then write out that group as a list, remove the out-of-window entries and add the newest entry. At the end there may a set of entries left on the list. dump_last() allows that set to be pulled. After you have these lists then there is a whole bunch of cheater-detecting analytics that you could perform.

Make sure the Time column is datetime

df['Time'] = df['Time'].astype('datetime64[ns]')

The Rolling Time List Class defintion

class RollingTimeList:
    def __init__(self):
        self.cur = pd.DataFrame(columns=['Student','Time'])
        self.window = pd.Timedelta('5T')

    def __add(self, row):
        idx = self.cur.index.max()
        new_idx = idx 1 if idx==idx else 0
        self.cur.loc[new_idx] = row[['Student','Time']]

    def handle_row(self, row):
        rc = None

        if len(self.cur) > 0:
            window_mask = (row['Time'] - self.cur['Time']).abs() <= self.window

            if ~window_mask.all():
                if len(self.cur) > 1:
                    rc = self.cur['Student'].to_list()
                self.cur = self.cur.loc[window_mask]

        self.__add(row)

        return rc

    def dump_last(self):
        rc = None

        if len(self.cur) > 1:
            rc = self.cur['Student'].to_list()

        self.cur = self.cur[0:0]
        return rc

Instantiate and apply the class

rolling_list = RollingTimeList()

s = df.apply(rolling_list.handle_row, axis=1)

idx = s.index.max()
s.loc[idx 1 if idx==idx else 0] = rolling_list.dump_last()

print(s.dropna())

Result

3    [Scooby, Daphne, Shaggy]
4      [Daphne, Shaggy, Fred]
5               [Fred, Velma]

Merge that back into the original data frame

df['window_groups'] = s
df['window_groups'] = df['window_groups'].shift(-1).fillna('')

print(df)

Result

  Class  Student   Challenge                Time             window_groups
0     A   Scooby  Challenge3 2022-07-10 08:22:44
1     A   Daphne  Challenge3 2022-07-10 08:27:22
2     A   Shaggy  Challenge3 2022-07-10 08:27:44  [Scooby, Daphne, Shaggy]
3     A     Fred  Challenge3 2022-07-10 08:29:55    [Daphne, Shaggy, Fred]
4     B    Velma  Challenge3 2022-07-10 08:33:14             [Fred, Velma]
5     B  Scrappy  Challenge3 2022-07-10 08:48:44

Visualization idea: cross reference table

dfd = pd.get_dummies(s.dropna().apply(pd.Series).stack()).groupby(level=0).sum()
xrf = dfd.T.dot(dfd)
print(xrf)

Result

        Daphne  Fred  Scooby  Shaggy  Velma
Daphne       2     1       1       2      0
Fred         1     2       0       1      1
Scooby       1     0       1       1      0
Shaggy       2     1       1       2      0
Velma        0     1       0       0      1

Or as Series of unique student combinations and counts

combos = xrf.stack()
unique_combo_mask = combos.index.get_level_values(0)<combos.index.get_level_values(1)
print(combos[unique_combo_mask])

Result

Daphne  Fred      1
        Scooby    1
        Shaggy    2
        Velma     0
Fred    Scooby    0
        Shaggy    1
        Velma     1
Scooby  Shaggy    1
        Velma     0
Shaggy  Velma     0

Also from here, you could determine that within some kind of time constraint (or other constraint) that the count really should be not more than one - due to overlapping lists. Like in this example, Daphne and Shaggy probably should not be counted twice. They didn't really come within 5 minutes of each other on two separate occasions. In which case anything greater than one could be set to one before being accumulated into a larger pool.