I am having a file where there are say n
columns. Where the first n-1
columns represent the value of the n-1
attributes and the n
-th column represent the value of the class
for a particular dataset. Now I want to first read that dataset and print a single line as output where it will print n-1
comma separated *
and then at the n
th column, the class with the maximum frequency will come and sit. For an example suppose I have a file dataset1.data
which contains :
12,13,14,44,0
11,11,10,34,0
22,54,98,11,2
34,90,78,90,1
44,34,34,33,1
22,54,98,11,0
34,90,78,90,2
44,34,34,33,1
22,54,98,11,2
34,90,78,90,2
44,34,34,33,2
For the above case the output will be: *,*,*,*,2
because class 2 has the highest frequency.
And in case of tie in the highest frequency count, it will take the minimum class value.
For an example:
12,13,14,44,0
11,11,10,34,0
22,54,98,11,2
34,90,78,90,1
44,34,34,33,1
22,54,98,11,0
34,90,78,90,2
44,34,34,33,1
22,54,98,11,2
In this case the output will be : *,*,*,*,0
because here all the class have the same frequency.
How can I do it? Can anyone help please!
CodePudding user response:
You could use collections.Counter
:
from collections import Counter
cls_counts = Counter()
with open('dataset1.data') as f:
for line in f:
row = list(map(int, line.strip().split(',')))
attrs, cls = row[:-1], row[-1]
cls_counts[cls] = 1
max_cls_val = max(cls_counts.values())
max_cls_keys = [cls for cls, count in cls_counts.items() if count == max_cls_val]
print(f"{'*,' * len(attrs)}{min(max_cls_keys)}")
Example Usage 1, Unique class with max count:
dataset1.data
:
12,13,14,44,0
11,11,10,34,0
22,54,98,11,2
34,90,78,90,1
44,34,34,33,1
22,54,98,11,0
34,90,78,90,2
44,34,34,33,1
22,54,98,11,2
34,90,78,90,2
44,34,34,33,2
Output:
*,*,*,*,2
Example Usage 2, Multiple classes with max count:
dataset1.data
:
12,13,14,44,0
11,11,10,34,0
22,54,98,11,2
34,90,78,90,1
44,34,34,33,1
22,54,98,11,0
34,90,78,90,2
44,34,34,33,1
22,54,98,11,2
Output:
*,*,*,*,0