Need to find top 10 used surnames in a files. Made a dictonary but need to sort it the rest-CodePudding

I made a surname dict containing surnames like this:

--The files contains 200 000 words, and this is a sample on the surname_dict--

['KRISTIANSEN', 'OLDERVIK', 'GJERSTAD', 'VESTLY SKIVIK', 'NYMANN', 'ØSTBY', 'LINNERUD', 'REMLO', 'SKARSHAUG', 'ELI', 'ADOLFSEN']

I am not allow to use counter liberay or numpy, just native python. My idea was to use forloop sorting trough the dictornary but just hit some walls. Please help with some advise.

Thanks

surname_dict = []
    count = 0
    for index in data_list:
        if index["lastname"] not in surname_dict:
            count = count   1
            surname_dict.append(index["lastname"])
    
   for k, v in sorted(surname_dict.items(), key=lambda item: item[1]):
        if count < 10:  # Print only the top 10 surnames
            print(k)
            count  = 1
        else:
            break

CodePudding user response：

As mentioned in a comment, your dict is actually a list.

Try using the Counter object from the collections library. In the below example, I have edited your list so that it contains a few duplicates.

from collections import Counter

surnames = ['KRISTIANSEN', 'OLDERVIK', 'GJERSTAD', 'VESTLY SKIVIK', 'NYMANN', 'ØSTBY', 'LINNERUD', 'REMLO', 'SKARSHAUG', 'ELI', 'ADOLFSEN', 'OLDERVIK', 'ØSTBY', 'ØSTBY']

counter = Counter(surnames)

for name in counter.most_common(3):
    print(name)

The result becomes:

('ØSTBY', 3)
('OLDERVIK', 2)
('KRISTIANSEN', 1)

Change the integer argument to most_common to 10 for your use case.

CodePudding user response：

The best approach to answer your question is to consider the top ten categories : for example : category of names that are used 9 times and category of names that are used 200 times and so . Because , we could have a case where 100 of users use different usernames but all of them have to be on the top 10 used username. So to implement my approach here is the script :

def counter(file : list):
    L = set(file)
    i = 0
    M = {}
    for j in L :
        for k in file :
            if j == k:
                i =1
        M.update({i : j})
        i = 0
    D = list(M.keys())
    D.sort()
    F = {}
    if len(D)>= 10:
        K = D[0:10]
        for i in K:
            F.update({i:D[i]})
        return F
    else :
        return M

Note: my script calculate the top ten categories .

CodePudding user response：

You could place all the values in a dictionary where the value is the number of times it appears in the dataset, and filter through your newly created dictionary and push any result that has a value count > 10 to your final array.

edit: your surname_dict was initialized as an array, not a dictionary.

surname_dict = {}
top_ten = []
for index in data_list:
    if index['lastname'] not in surname_dict.keys():
        surname_dict[index['lastname']] = 1
    else:
        surname_dict[index['lastname']]  = 1

for k, v in sorted(surname_dict.items()):
    if v >= 10:
        top_ten.append(k)
return top_ten

CodePudding user response：

Just use a standard dictionary. I've added some duplicates to your data, and am using a threshold value to grab any names with more than 2 occurences. Use threshold = 10 for your actual code.

names = ['KRISTIANSEN', 'OLDERVIK', 'GJERSTAD', 'VESTLY SKIVIK', 'NYMANN', 'ØSTBY','ØSTBY','ØSTBY','REMLO', 'LINNERUD', 'REMLO', 'SKARSHAUG', 'ELI', 'ADOLFSEN']

# you need 10 in your code, but I've only added a few dups to your sample data
threshold = 2

di = {}
for name in names:
    #grab name count, initialize to zero first time
    count = di.get(name, 0)
    di[name] = count   1


#basic filtering, no sorting
unsorted = {name:count for name, count in di.items() if count >= threshold}
print(f"{unsorted=}")


#sorting by frequency: filter out the ones you don't want
bigenough = [(count, name) for name, count in di.items() if count >= threshold]

tops = sorted(bigenough, reverse=True)

print(f"{tops=}")

#or as another dict

tops_dict = {name:count for count, name in tops}
print(f"{tops_dict=}")

Output:

unsorted={'ØSTBY': 3, 'REMLO': 2}
tops=[(3, 'ØSTBY'), (2, 'REMLO')]
tops_dict={'ØSTBY': 3, 'REMLO': 2}