Home > Software design >  find max duplicate values in a list
find max duplicate values in a list

Time:08-28

I have a file containing multiple lines in format student code and followed by some answer. e.g

N00000047,B,,D,C,C,B,D,D,C,C,D,,A,B,D,C,,D,A,C,,D,B,D,C
N00000048,B,A,D,D,C,B,,D,C,C,D,B,A,B,A,D,B,D,A,C,A,A,B,D,D
N00000049,A,,D,D,C,B,D,,C,C,D,B,,B,A,C,C,D,A,C,A,A,B,D,D
N00000050,,C,,D,,D,D,A,C,A,A,B,A,B,A,D,B,D,A,C,D,A,B,D,D
N00000051,B,A,B,,C,B,D,A,C,C,D,D,A,B,A,C,B,C,A,,A,A,B,D,B
N00000052,B,A,D,D,,B,D,A,D,,D,B,A,B,A,C,B,C,A,C,A,A,B,D,D
N00000053,B,A,D,D,C,B,D,A,C,C,D,B,B,B,C,C,B,D,A,C,A,C,A,D,D

And now I have to find which is the most question was skipped by students by order which question, how many student skipped and how many % student skipped that question.

I was split then make a loop and add every entry of skipped question in a list and then got stuck in find the max duplicates values in a list (it can be more than 1 output). This is some expected output: Question that most people answer incorrectly: 10 - 4 - 0.20, 14 - 4 - 0.20, 16 - 4 - 0.20, 19 - 4 - 0.20, 22 - 4 - 0.20. in format : a - b - c which a is question number, b is how much student was skipped, c is it take how many percentage of total student in class

Edited: I put all skipped question in a list and count for which question have a largest duplicate like this:

def find_max_count(list):
    item_with_max_count = []
    max_count = 0
    for item in list:
        item_count = list.count(item)
        if  item_count > max_count:
            max_count = list.count
    for item1 in list:
        if list.count(item1) == max_count:
            item_with_max_count.append(item1)
    return item_with_max_count

but there is an error: TypeError: '>' not supported between instances of 'int' and 'builtin_function_or_method'

CodePudding user response:

Start by accumulating a dictionary of all responses to each question and a list of all skipped answers:

from collections import defaultdict

responses = defaultdict(list) # all responses to a given question
skipped = []                  # all skiped answers
for record in data.splitlines():
    student_id, *answers = record.split(',')
    for question_number, answer in enumerate(answers, start=1):
        responses[question_number].append(answer)
        if answer == '':
           skipped.append(question_number)

Next perform the analysis:

from statistics import multimode

print('Most skipped questions:', multimode(skipped))
print('Answer for questions with more than two or more skips')
for question, answers in responses.items():
    if answers.count('') >= 2:
        print(f'Question {question}: {answers}')

This outputs:

Most skipped questions: [2, 5]
Answer for questions with more than two or more skips
Question 2: ['', 'A', '', 'C', 'A', 'A', 'A']
Question 5: ['C', 'C', 'C', '', 'C', '', 'C']

I'm certain this is what you wanted (a target output wasn't shown), but this should get you started the key techniques for analysis. In particular, the multimode function is super helpful in identifying most frequent occurrences including ties for first place. Also defaultdict is super useful for transposing the data from answers by student to answers by question.

CodePudding user response:

Are you want sort?

A='N00000050,C,D,D,D,A,C,A,A,B,A,B,A,D,B,D,A,C,D,A,B,D,D'

B=A.split(',') #output= ['N00000050', 'C', 'D', 'D', 'D', 'A', 'C', 'A', 'A', 'B', 'A', 'B', 'A', 'D', 'B', 'D', 'A', 'C', 'D', 'A', 'B', 'D', 'D']

SortValue=sorted(B) #Output :['A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'N00000050']

SortValueDec=sorted(B,reverse=True) #putput:['N00000050', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'C', 'C', 'C', 'B', 'B', 'B', 'B', 'A', 'A', 'A', 'A', 'A', 'A', 'A']
  • Related