Casting Counter() to dict or list-CodePudding

I have a problem (more lined out here (Best way to compare multiple key values [lists] and return multiples?))

Short summary:

Now I have several lists I want to compare, filtering out values that are present in more than one list.

I want to get: All values that are present in more than one list How often these values are present (so, like if they are present 2 times in every list, I want to give out these 2 - not the total occurences in all lists!) And, in the end: I want to count values that are in more then one list, but not in every list.

The Setup:

In a loop, I add lists of data I want to compare to a "master" list:

[
['Limerick (IRE)', 'Fairyhouse (IRE)', 'Gowran Park (IRE)', 'Galway (IRE)', 'Roscommon (IRE)', 'Ballinrobe (IRE)', 'Roscommon (IRE)', 'Downpatrick (IRE)', 'Ballinrobe (IRE)', 'Curragh (IRE)', 'Naas (IRE)', 'Curragh (IRE)', 'Galway (IRE)', 'Cork (IRE)', 'Punchestown (IRE)', 'Galway (IRE)', 'Tipperary (IRE)', 'Curragh (IRE)', 'Gowran Park (IRE)', 'Cork (IRE)', 'Galway (IRE)', 'Killarney (IRE)', 'Curragh (IRE)', 'Roscommon (IRE)', 'Limerick (IRE)', 'Newton Abbot', 'Bangor-on-Dee', 'Bangor-on-Dee'],

['Newton Abbot', 'Worcester', 'Ffos Las', 'Worcester', 'Newton Abbot', 'Hereford', 'Worcester', 'Chepstow', 'Newton Abbot', 'Bangor-on-Dee', 'Stratford', 'Ffos Las', 'Huntingdon', 'Newton Abbot', 'Bangor-on-Dee'],

['Aintree', 'Market Rasen', 'Market Rasen', 'Newcastle', 'Stratford', 'Hexham', 'Cartmel', 'Stratford', 'Cartmel', 'Cartmel','Bangor-on-Dee', 'Stratford', 'Ffos Las', 'Huntingdon', 'Newton Abbot', 'Bangor-on-Dee', 'Killarney (IRE)']
]

There can be only 2 lists, or 20, or more to compare.

Now I try to get the multiples by Counter()and extract the most common ones PSEUDO CODE:

        doubles= Counter()
        for w in testlist[1]:
            doubles[w] = testlist[2].count(w)
        result4 = results3.most_common(2)
        result5 = [result4[0]]

But this does not work as intended: Because it counts the occurences of the multiples in one list! (if in list 1 the word / number is there one time, but in the second one it is five times, I still want to get only 1 as an output - not six, or five. If it is two times in List[1] and 3 times in List[2], I want to **get 2 ( the number/word!) **as an output, and so on)

The second problem that I have: The number of lists varies. Can be 2, can be 20. So testlist[1] is just a placeholder - I would have to check all 20 lists (or whatever the number) for occurences of the word/number that is in every list.

I can´t wrap my head around how to do this. Hopefully you can help me

Edit

Added a third list for the example
Expected Output: Comparing these lists, I would like to get something like:

Bangor-on-Dee: 3 (because it is in all 3 lists), 2 (because it is in all lists, at least two times)
Killarney: 2 (because it is only in 2 of those lists), 1 (beause it is only at least once in those lists)

CodePudding user response：

You can first check if the element is in both lists, then get the minimum count of the element in both lists:

arr1 = ['Newton Abbot', 'Worcester', 'Ffos Las', 'Worcester', 'Newton Abbot', 'Hereford', 'Worcester', 'Chepstow',
        'Newton Abbot', 'Bangor-on-Dee', 'Stratford', 'Ffos Las', 'Huntingdon', 'Newton Abbot', 'Bangor-on-Dee']
arr2 = ['Aintree', 'Market Rasen', 'Market Rasen', 'Newcastle', 'Stratford', 'Hexham', 'Cartmel', 'Stratford',
        'Cartmel', 'Cartmel', 'Bangor-on-Dee', 'Stratford', 'Ffos Las', 'Huntingdon', 'Newton Abbot', 'Bangor-on-Dee']

all_lists = [arr1, arr2]

occurences = {}

for x in all_lists[0]:
    # check if x is in all_lists
    x_in_all_lists = True
    counts = []
    for arr in all_lists:
        if x not in arr:
            x_in_all_lists = False
            break
        counts.append(arr.count(x))
    # get the least number of times x occurs in the lists of all_lists
    if x_in_all_lists:
        occurences[x] = min(counts)

for k, v in occurences.items():
    print(f'{k}\t{v}')

Output:

Newton Abbot    1
Ffos Las        1
Bangor-on-Dee   2
Stratford       1
Huntingdon      1

CodePudding user response：

prior to any optimization, I would break this task into three steps,

count the occurrences of each key in each row
merge the counts based on key
print the results based on what we find out about the counts after merging them

import collections
import json

data = [
    ['Limerick (IRE)', 'Fairyhouse (IRE)', 'Gowran Park (IRE)', 'Galway (IRE)', 'Roscommon (IRE)', 'Ballinrobe (IRE)', 'Roscommon (IRE)', 'Downpatrick (IRE)', 'Ballinrobe (IRE)', 'Curragh (IRE)', 'Naas (IRE)', 'Curragh (IRE)', 'Galway (IRE)', 'Cork (IRE)', 'Punchestown (IRE)', 'Galway (IRE)', 'Tipperary (IRE)', 'Curragh (IRE)', 'Gowran Park (IRE)', 'Cork (IRE)', 'Galway (IRE)', 'Killarney (IRE)', 'Curragh (IRE)', 'Roscommon (IRE)', 'Limerick (IRE)', 'Newton Abbot', 'Bangor-on-Dee', 'Bangor-on-Dee'],
    ['Newton Abbot', 'Worcester', 'Ffos Las', 'Worcester', 'Newton Abbot', 'Hereford', 'Worcester', 'Chepstow', 'Newton Abbot', 'Bangor-on-Dee', 'Stratford', 'Ffos Las', 'Huntingdon', 'Newton Abbot', 'Bangor-on-Dee'],
    ['Aintree', 'Market Rasen', 'Market Rasen', 'Newcastle', 'Stratford', 'Hexham', 'Cartmel', 'Stratford', 'Cartmel', 'Cartmel','Bangor-on-Dee', 'Stratford', 'Ffos Las', 'Huntingdon', 'Newton Abbot', 'Bangor-on-Dee', 'Killarney (IRE)']
]

## ---------------------
## Gather the per row counts
## ---------------------
data_counted = [
    dict(collections.Counter(row))
    for row
    in data
]
#print(json.dumps(data_counted, indent=4, sort_keys=True))
## ---------------------

## ---------------------
## merge the rows on name
## ---------------------
data_counted_combined = {}
for row in data_counted:
    for name, count in row.items():
        target = data_counted_combined.setdefault(name, []) ## make sure this key is initialized
        target.append(count)
#print(json.dumps(data_counted_combined, indent=4, sort_keys=True))
## ---------------------

## ---------------------
## Generate the final result (sorted for fun)
## ---------------------
for key, value in sorted(data_counted_combined.items(), key=lambda x: x[0]):
    print(f"\"{key}\" appears in { len(value) } list(s) a minimum of { min(value) } times.")
## ---------------------

This produces the following:

"Aintree" appears in 1 list(s) a minimum of 1 times.
"Ballinrobe (IRE)" appears in 1 list(s) a minimum of 2 times.
"Bangor-on-Dee" appears in 3 list(s) a minimum of 2 times.
"Cartmel" appears in 1 list(s) a minimum of 3 times.
"Chepstow" appears in 1 list(s) a minimum of 1 times.
"Cork (IRE)" appears in 1 list(s) a minimum of 2 times.
"Curragh (IRE)" appears in 1 list(s) a minimum of 4 times.
"Downpatrick (IRE)" appears in 1 list(s) a minimum of 1 times.
"Fairyhouse (IRE)" appears in 1 list(s) a minimum of 1 times.
"Ffos Las" appears in 2 list(s) a minimum of 1 times.
"Galway (IRE)" appears in 1 list(s) a minimum of 4 times.
"Gowran Park (IRE)" appears in 1 list(s) a minimum of 2 times.
"Hereford" appears in 1 list(s) a minimum of 1 times.
"Hexham" appears in 1 list(s) a minimum of 1 times.
"Huntingdon" appears in 2 list(s) a minimum of 1 times.
"Killarney (IRE)" appears in 2 list(s) a minimum of 1 times.
"Limerick (IRE)" appears in 1 list(s) a minimum of 2 times.
"Market Rasen" appears in 1 list(s) a minimum of 2 times.
"Naas (IRE)" appears in 1 list(s) a minimum of 1 times.
"Newcastle" appears in 1 list(s) a minimum of 1 times.
"Newton Abbot" appears in 3 list(s) a minimum of 1 times.
"Punchestown (IRE)" appears in 1 list(s) a minimum of 1 times.
"Roscommon (IRE)" appears in 1 list(s) a minimum of 3 times.
"Stratford" appears in 2 list(s) a minimum of 1 times.
"Tipperary (IRE)" appears in 1 list(s) a minimum of 1 times.
"Worcester" appears in 1 list(s) a minimum of 3 times.

CodePudding user response：

Here's a breakdown of the code:

Count the frequency of each word in each list (counters)
Collect all the words from all the lists, and generate the set of words that we can expect to appear in any of these lists (all_words)
Define a dict maxFreqDict which maps words in all_words to tuples (a,b) where
a = the number of lists this word appears in
b = the minimum number of occurrences of this word (among the lists where it appears)
Initialize each a to 0 and b to some large number which we later shrink down.
Run counters through a reduce function where for each possible word, we update the tuple (a,b) if word is present in the current list, otherwise do nothing.
Print out the result (the final state of maxFreqDict)

Note: more_itertools is an installed package that extends the utilities of the built-in itertools. more_itertools.flatten(array2d) can be replaced with itertools.chain(*array2d) if you don't want to install an additional package.

from collections import Counter
from functools import reduce
from more_itertools import flatten
from pprint import pprint

lists = [
    ['Limerick (IRE)', 'Fairyhouse (IRE)', 'Gowran Park (IRE)', 'Galway (IRE)', 'Roscommon (IRE)',
     'Ballinrobe (IRE)', 'Roscommon (IRE)', 'Downpatrick (IRE)', 'Ballinrobe (IRE)', 'Curragh (IRE)',
     'Naas (IRE)', 'Curragh (IRE)', 'Galway (IRE)', 'Cork (IRE)', 'Punchestown (IRE)', 'Galway (IRE)',
     'Tipperary (IRE)', 'Curragh (IRE)', 'Gowran Park (IRE)', 'Cork (IRE)', 'Galway (IRE)', 'Killarney (IRE)',
     'Curragh (IRE)', 'Roscommon (IRE)', 'Limerick (IRE)', 'Newton Abbot', 'Bangor-on-Dee', 'Bangor-on-Dee'],
    ['Newton Abbot', 'Worcester', 'Ffos Las', 'Worcester', 'Newton Abbot', 'Hereford', 'Worcester', 'Chepstow',
        'Newton Abbot', 'Bangor-on-Dee', 'Stratford', 'Ffos Las', 'Huntingdon', 'Newton Abbot', 'Bangor-on-Dee'],
    ['Aintree', 'Market Rasen', 'Market Rasen', 'Newcastle', 'Stratford', 'Hexham', 'Cartmel', 'Stratford', 'Cartmel',
        'Cartmel', 'Bangor-on-Dee', 'Stratford', 'Ffos Las', 'Huntingdon', 'Newton Abbot', 'Bangor-on-Dee', 'Killarney (IRE)']
]


# counter the frequency of each word in each list
counters = [Counter(l) for l in lists]
# find all the unique words across all lists
all_words = set(flatten(c.keys() for c in counters))

def reduceFunc(maxFreqDict: dict[str, tuple[int, int]], curCounter: Counter) -> dict[str, tuple[int, int]]:
    for word in maxFreqDict.keys():
        if word in curCounter:
            # listFeq: in how many lists this word appears
            # wordFreq: the min number of occurrences of this word (among the lists that it appears in) 
            listFreq, wordFreq = maxFreqDict[word]
            minWordFreq = min(wordFreq, curCounter[word])
            maxFreqDict[word] = (listFreq   1, minWordFreq)
        else:
            pass
    return maxFreqDict

# start with a large number as the frequency of each word
# so we can find the actual frequency in reduceFunc using min()
maxFreqDict = {word: (0, 1e9) for word in all_words}
res = reduce(reduceFunc, counters, maxFreqDict)

pprint(res)

This should produce the following output:

{'Aintree': (1, 1),
 'Ballinrobe (IRE)': (1, 2),
 'Bangor-on-Dee': (3, 2),
 'Cartmel': (1, 3),
 'Chepstow': (1, 1),
 'Cork (IRE)': (1, 2),
 'Curragh (IRE)': (1, 4),
 'Downpatrick (IRE)': (1, 1),        
 'Fairyhouse (IRE)': (1, 1),
 'Ffos Las': (2, 1),
 'Galway (IRE)': (1, 4),
 'Gowran Park (IRE)': (1, 2),        
 'Hereford': (1, 1),
 'Hexham': (1, 1),
 'Huntingdon': (2, 1),
 'Killarney (IRE)': (2, 1),
 'Limerick (IRE)': (1, 2),
 'Market Rasen': (1, 2),
 'Naas (IRE)': (1, 1),
 'Newcastle': (1, 1),
 'Newton Abbot': (3, 1),
 'Punchestown (IRE)': (1, 1),        
 'Roscommon (IRE)': (1, 3),
 'Stratford': (2, 1),
 'Tipperary (IRE)': (1, 1),
 'Worcester': (1, 3)}