aggregate by value and count, distinct array-CodePudding

Let's say i have this list of tuples

[
('r', 'p', ['A', 'B']),
('r', 'f', ['A']),
('r', 'e', ['A']),
('r', 'p', ['A']),
('r', 'f', ['B']),
('r', 'p', ['B']),
('r', 'e', ['B']),
('r', 'c', ['A'])
]

Need to return a list of tuples that aggregated (group by) by the second value in the tuple, count the number of the aggregation. for the third value, which is an array, need to distinct and aggregate it.

So for the example above, the result will be:

[
('r', 'p', ['A', 'B'], 4),
('r', 'f', ['A', 'B'], 2),
('r', 'e', ['A', 'B'], 2),
('r', 'c', ['A'], 1)
]

In the result, the first value is a const, the second is unique (it was grouped by) the third is distinct grouped array, and the forth is the count of values of the array if we grouped them

CodePudding user response：

You could do this in pandas

import pandas as pd

df = pd.DataFrame([
('r', 'p', ['A', 'B']),
('r', 'f', ['A']),
('r', 'e', ['A']),
('r', 'p', ['A']),
('r', 'f', ['B']),
('r', 'p', ['B']),
('r', 'e', ['B']),
('r', 'c', ['A'])
], columns=['first','second','arr'])

pd.merge(df.explode('arr').groupby(['first','second']).agg(set).reset_index(),
         df[['first','second']].value_counts().reset_index(),
         on=['first','second']).values.tolist()

Output

[
    ['r', 'c', {'A'}, 1],
    ['r', 'e', {'B', 'A'}, 2],
    ['r', 'f', {'B', 'A'}, 2],
    ['r', 'p', {'B', 'A'}, 3]
]

To address your edit you could do this:

(
  df.explode('arr')
    .value_counts()
    .reset_index()
    .groupby(['first','second'])
    .agg({'arr':set, 0:sum})
    .reset_index()
    .values
    .tolist()
)

Output

[
   ['r', 'c', {'A'}, 1],
   ['r', 'e', {'B', 'A'}, 2],
   ['r', 'f', {'B', 'A'}, 2],
   ['r', 'p', {'B', 'A'}, 4]
]

CodePudding user response：

Here's my attempt using itertools.

from itertools import groupby

data = [
('r', 'p', ['A', 'B']),
('r', 'f', ['A']),
('r', 'e', ['A']),
('r', 'p', ['A']),
('r', 'f', ['B']),
('r', 'p', ['B']),
('r', 'e', ['B']),
('r', 'c', ['A'])
]

# groupby needs sorted data
data.sort(key=lambda x: (x[0], x[1]))
result = []
for key,group in groupby(data, key=lambda x: (x[0], x[1])):
    # Make the AB list. Ex: s = ['A', 'B', 'A', 'B']
    s = [item for x in group for item in x[2]]
    # Put it all together. Ex: ('r', 'p', ['A', 'B'], 4)
    result.append(tuple(list(key)   [list(set(s))]   [len(s)]))

CodePudding user response：

I hope I've understood your question well:

data = [
    ("r", "p", ["A", "B"]),
    ("r", "f", ["A"]),
    ("r", "e", ["A"]),
    ("r", "p", ["A"]),
    ("r", "f", ["B"]),
    ("r", "p", ["B"]),
    ("r", "e", ["B"]),
    ("r", "c", ["A"]),
]

out = {}
for a, b, c in data:
    out.setdefault((a, b), []).append(c)

out = [
    (a, b, list(set(v for l in c for v in l)), sum(map(len, c)))
    for (a, b), c in out.items()
]

print(out)

Prints:

[
    ("r", "p", ["B", "A"], 4),
    ("r", "f", ["B", "A"], 2),
    ("r", "e", ["B", "A"], 2),
    ("r", "c", ["A"], 1),
]