How to get statistics from an array Collection ? (min max, and average)-CodePudding

I have a text file that has a long 2D array as follows:

[[1, 2], [5,585], [2, 0], [1, 500], [2, 668], [3, 54], [4, 28], [3, 28], [4,163], [3,85], [5,906], [2,5000], [6,358], [4,69], [3,89], [7, 258],[5, 632], [7, 585] ..... [6, 47]]

The first element of each has numbers between 1 to 7. I want to read information of second element for all and find the maximum and minimum amount for each group between 1 to 7 separately. For example output like this:

Mix for first element with 1: 500  
Max for first element with 1: 2
average: 251

Min for row with 2:  0
Max for row with 2:  5000
average: 2500

and so on

What is the most efficient way of getting min, max, and average values by grouping based on the first element of the array?

file = open("myfile.txt", "r")
list_of_lists = file.read()

unique_values = set([list[1] for list in list_of_lists])
group_list = [[list[0] for list in list_of_lists if list[1] == value] for value in unique_values]

print(group_list)

CodePudding user response：

Use pandas

data = [[1, 2], [5,585], [2, 0], [1, 500], [2, 668],
        [3, 54], [4, 28], [3, 28], [4,163], [3,85],
        [5,906], [2,5000], [6,358], [4,69], [3,89],
        [7, 258],[5, 632], [7, 585]]

grp = []
col = []
for k, v in data:
    grp.append(k)
    col.append(v)

pd.Series(col).groupby(grp).agg(["min", "mean", "max"])
#    min         mean   max
# 1    2   251.000000   500
# 2    0  1889.333333  5000
# 3   28    64.000000    89
# 4   28    86.666667   163
# 5  585   707.666667   906
# 6  358   358.000000   358
# 7  258   421.500000   585

CodePudding user response：

We can use pandas for this:

import numpy as np
import pandas as pd

file_data = [[1, 2], [5,585], [2, 0], [1, 500], [2, 668], [3, 54], [4, 28], [3, 28], [4,163], [3,85], [5,906], [2,5000], [6,358], [4,69], [3,89], [7, 258],[5, 632], [7, 585], [6, 47]]

file_data = np.array(file_data)

df = pd.DataFrame(data = {'num': file_data[:, 0], 'data': file_data[:, 1]})

for i in np.sort(df['num'].unique()):
    print('Min for', i, ':', df.loc[df['num'] == i, 'data'].min())
    print('Max for', i, ':', df.loc[df['num'] == i, 'data'].max())
    temp_df = df.loc[df['num'] == i, 'data']
    print("Average for", i, ":", temp_df.sum()/len(temp_df.index))

This gives us:

Min for 1 : 2
Max for 1 : 500
Average for 1 : 251.0
Min for 2 : 0
Max for 2 : 5000
Average for 2 : 1889.3333333333333
Min for 3 : 28
Max for 3 : 89
Average for 3 : 64.0
Min for 4 : 28
Max for 4 : 163
Average for 4 : 86.66666666666667
Min for 5 : 585
Max for 5 : 906
Average for 5 : 707.6666666666666
Min for 6 : 47
Max for 6 : 358
Average for 6 : 202.5
Min for 7 : 258
Max for 7 : 585
Average for 7 : 421.5

CodePudding user response：

You can maintain a dictionary that maps group IDs (integers) to lists of size two (one entry for the group's minimum, one for the group's maximum). To extract these values, make one pass through the list. Notably, this approach does not require the use of any heavy dependencies, like numpy or pandas. This also does not require sorting, so it runs faster asymptotically O(n) for my approach versus O(n log n) for sorting:

data = {}

for group, entry in items:
    if group not in data:
        data[group] = [entry, entry]
    else:
        current_min, current_max = data[group]
        data[group] = [min(entry, current_min), max(entry, current_max)]

for key in data:
    print(f"Min for row with {key}: {data[key][0]}")
    print(f"Max for row with {key}: {data[key][1]}")

This outputs:

Min for row with 1: 2
Max for row with 1: 500
Min for row with 5: 585
Max for row with 5: 906
Min for row with 2: 0
Max for row with 2: 5000
Min for row with 3: 28
Max for row with 3: 89
Min for row with 4: 28
Max for row with 4: 163
Min for row with 6: 47
Max for row with 6: 358
Min for row with 7: 258
Max for row with 7: 585

CodePudding user response：

You can sort the list-of-lists by the first element of each item and then groupby the same element.

import itertools
l = [[1, 2], [5,585], [2, 0], [1, 500], [2, 668], [3, 54], [4, 28], [3, 28], [4,163], [3,85], [5,906], [2,5000], [6,358], [4,69], [3,89], [7, 258],[5, 632], [7, 585]]

l.sort(key=lambda item: item[0])
groups = { k: [item[1] for item in v] 
           for k, v in itertools.groupby(l, key=lambda item: item[0])}

Why do I need to sort first?

This gives groups=

{1: [2, 500],
 2: [0, 668, 5000],
 3: [54, 28, 85, 89],
 4: [28, 163, 69],
 5: [585, 906, 632],
 6: [358],
 7: [258, 585]}

An explanation of the groups = ... line:

First, I groupby() the sorted list using the first element of each item as the key. This groups all elements with the same key in a single iterable, so for the key of 1, we'd have an iterable containing the elements [1, 2] and [1, 500].
I iterate over this groupby() result, and create a dictionary using a dict comprehension
- For the keys of the dict, I use the key of the groupby()
- For the values of the dict, I have a list comprehension that iterates over each item in the group, and takes only the first element of that item (so the 1 key will now have a value that is a list containing 2 and 500).

Then, find the max and min of each group, using only the second element of each item:

max_vals = {k: max(v) for k, v in groups.items()}
# {1: 500, 2: 5000, 3: 89, 4: 163, 5: 906, 6: 358, 7: 585}

min_vals = {k: min(v) for k, v in groups.items()}
# {1: 2, 2: 0, 3: 28, 4: 28, 5: 585, 6: 358, 7: 258}

avg_vals = {k: sum(v) / len(v) for k, v in groups.items()}
# {1: 251.0,  2: 1889.3333333333333,  3: 64.0,  4: 86.66666666666667,  5: 707.6666666666666,  6: 358.0,  7: 421.5}

Or, to print them all the way you want:

for k, v in groups.items():
    print(f"Max for first element with {k}: {max(v)}")  
    print(f"Min for first element with {k}: {min(v)}")
    print(f"Average: {sum(v) / len(v)}")

Which gives:

Max for first element with 1: 500
Min for first element with 1: 2
Average: 251.0
Max for first element with 2: 5000
Min for first element with 2: 0
Average: 1889.3333333333333
Max for first element with 3: 89
Min for first element with 3: 28
Average: 64.0
... and so on