Count the number of lists containing specific element in a nested list with mapreduce-CodePudding

I have a list of lists containing strings and I want to count in how many of those lists each element appears:

list_of_lists = [["dog", "cow"], ["dragon", "ox", "cow"], ["fox", "cow", "dog"]]

So, cow appears in 3 lists, dog appears in 2 etc.

For such a small dataset, I would normally do:

from collections import Counter
from itertools import chain
count = Counter(chain.from_iterable(set(x) for x in list_of_lists))

and thus:

print(count["dog"])
2

However, I want to do that for a large dataset using PySpark and MapReduce so that for each element in the list of lists, I would have the above counter value:

    [("dog", 2),
    ("cow", 3),
    ("dragon", 1),
    ("ox", 1),
    ("fox", 1)]

etc.

I am trying things like:

list_of_lists = sc.parallelize(list_of_lists)
list_occurencies = list_of_lists.map(lambda x: x, count[x])

with no effect

CodePudding user response：

Use flatMap to flatten the nested arrays then reduceByKey to get the count of each word in the list:

list_of_lists = sc.parallelize(list_of_lists)

list_of_lists = list_of_lists.flatMap(lambda x: set(x))\
    .map(lambda x: (x, 1))\
    .reduceByKey(lambda a, b: a   b)

print(list_of_lists.collect())
# [('fox', 1), ('dragon', 1), ('ox', 1), ('dog', 2), ('cow', 3)]

CodePudding user response：

It seems sublists don't contain duplicate entries since you're using set there. In that case, I suggest flattening your list and count the frequency of each item.

I applied the following list flattening function to the list of lists and used collections.Counter.

def flatten_list(lsts):
    out = []
    for l in lsts:
        out  = l
    return out

I used the following to generate a list of lists:

import random
random.seed(420)
lsts = []
for i in range(10000):
    n = random.randint(10,100)
    lsts.append(random.sample(range(n),int(n/2)))

and timed the counter:

%timeit y = Counter(chain.from_iterable(set(x) for x in lsts))
8.37 ms ± 408 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


%timeit x = Counter(flatten_list(lsts))
5.37 ms ± 278 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

It might not be the kind of solution you're looking for but it is an idea.