I have a text file that has a long 2D array as follows:
[[1, 2], [5,585], [2, 0], [1, 500], [2, 668], [3, 54], [4, 28], [3, 28], [4,163], [3,85], [5,906], [2,5000], [6,358], [4,69], [3,89], [7, 258],[5, 632], [7, 585] ..... [6, 47]]
The first element of each has numbers between 1 to 7. I want to read information of second element for all and find the maximum and minimum amount for each group between 1 to 7 separately. For example output like this:
Mix for first element with 1: 500
Max for first element with 1: 2
average: 251
Min for row with 2: 0
Max for row with 2: 5000
average: 2500
and so on
What is the most efficient way of getting min, max, and average values by grouping based on the first element of the array?
file = open("myfile.txt", "r")
list_of_lists = file.read()
unique_values = set([list[1] for list in list_of_lists])
group_list = [[list[0] for list in list_of_lists if list[1] == value] for value in unique_values]
print(group_list)
CodePudding user response:
Use pandas
data = [[1, 2], [5,585], [2, 0], [1, 500], [2, 668],
[3, 54], [4, 28], [3, 28], [4,163], [3,85],
[5,906], [2,5000], [6,358], [4,69], [3,89],
[7, 258],[5, 632], [7, 585]]
grp = []
col = []
for k, v in data:
grp.append(k)
col.append(v)
pd.Series(col).groupby(grp).agg(["min", "mean", "max"])
# min mean max
# 1 2 251.000000 500
# 2 0 1889.333333 5000
# 3 28 64.000000 89
# 4 28 86.666667 163
# 5 585 707.666667 906
# 6 358 358.000000 358
# 7 258 421.500000 585
CodePudding user response:
We can use pandas
for this:
import numpy as np
import pandas as pd
file_data = [[1, 2], [5,585], [2, 0], [1, 500], [2, 668], [3, 54], [4, 28], [3, 28], [4,163], [3,85], [5,906], [2,5000], [6,358], [4,69], [3,89], [7, 258],[5, 632], [7, 585], [6, 47]]
file_data = np.array(file_data)
df = pd.DataFrame(data = {'num': file_data[:, 0], 'data': file_data[:, 1]})
for i in np.sort(df['num'].unique()):
print('Min for', i, ':', df.loc[df['num'] == i, 'data'].min())
print('Max for', i, ':', df.loc[df['num'] == i, 'data'].max())
temp_df = df.loc[df['num'] == i, 'data']
print("Average for", i, ":", temp_df.sum()/len(temp_df.index))
This gives us:
Min for 1 : 2
Max for 1 : 500
Average for 1 : 251.0
Min for 2 : 0
Max for 2 : 5000
Average for 2 : 1889.3333333333333
Min for 3 : 28
Max for 3 : 89
Average for 3 : 64.0
Min for 4 : 28
Max for 4 : 163
Average for 4 : 86.66666666666667
Min for 5 : 585
Max for 5 : 906
Average for 5 : 707.6666666666666
Min for 6 : 47
Max for 6 : 358
Average for 6 : 202.5
Min for 7 : 258
Max for 7 : 585
Average for 7 : 421.5
CodePudding user response:
You can maintain a dictionary that maps group IDs (integers) to lists of size two (one entry for the group's minimum, one for the group's maximum). To extract these values, make one pass through the list. Notably, this approach does not require the use of any heavy dependencies, like numpy
or pandas
. This also does not require sorting, so it runs faster asymptotically O(n)
for my approach versus O(n log n)
for sorting:
data = {}
for group, entry in items:
if group not in data:
data[group] = [entry, entry]
else:
current_min, current_max = data[group]
data[group] = [min(entry, current_min), max(entry, current_max)]
for key in data:
print(f"Min for row with {key}: {data[key][0]}")
print(f"Max for row with {key}: {data[key][1]}")
This outputs:
Min for row with 1: 2
Max for row with 1: 500
Min for row with 5: 585
Max for row with 5: 906
Min for row with 2: 0
Max for row with 2: 5000
Min for row with 3: 28
Max for row with 3: 89
Min for row with 4: 28
Max for row with 4: 163
Min for row with 6: 47
Max for row with 6: 358
Min for row with 7: 258
Max for row with 7: 585
CodePudding user response:
You can sort the list-of-lists by the first element of each item and then groupby
the same element.
import itertools
l = [[1, 2], [5,585], [2, 0], [1, 500], [2, 668], [3, 54], [4, 28], [3, 28], [4,163], [3,85], [5,906], [2,5000], [6,358], [4,69], [3,89], [7, 258],[5, 632], [7, 585]]
l.sort(key=lambda item: item[0])
groups = { k: [item[1] for item in v]
for k, v in itertools.groupby(l, key=lambda item: item[0])}
This gives groups=
{1: [2, 500],
2: [0, 668, 5000],
3: [54, 28, 85, 89],
4: [28, 163, 69],
5: [585, 906, 632],
6: [358],
7: [258, 585]}
An explanation of the groups = ...
line:
- First, I
groupby()
the sorted list using the first element of each item as the key. This groups all elements with the same key in a single iterable, so for the key of1
, we'd have an iterable containing the elements[1, 2]
and[1, 500]
. - I iterate over this
groupby()
result, and create a dictionary using a dict comprehension- For the keys of the dict, I use the key of the
groupby()
- For the values of the dict, I have a list comprehension that iterates over each item in the group, and takes only the first element of that item (so the
1
key will now have a value that is a list containing2
and500
).
- For the keys of the dict, I use the key of the
Then, find the max
and min
of each group, using only the second element of each item:
max_vals = {k: max(v) for k, v in groups.items()}
# {1: 500, 2: 5000, 3: 89, 4: 163, 5: 906, 6: 358, 7: 585}
min_vals = {k: min(v) for k, v in groups.items()}
# {1: 2, 2: 0, 3: 28, 4: 28, 5: 585, 6: 358, 7: 258}
avg_vals = {k: sum(v) / len(v) for k, v in groups.items()}
# {1: 251.0, 2: 1889.3333333333333, 3: 64.0, 4: 86.66666666666667, 5: 707.6666666666666, 6: 358.0, 7: 421.5}
Or, to print them all the way you want:
for k, v in groups.items():
print(f"Max for first element with {k}: {max(v)}")
print(f"Min for first element with {k}: {min(v)}")
print(f"Average: {sum(v) / len(v)}")
Which gives:
Max for first element with 1: 500
Min for first element with 1: 2
Average: 251.0
Max for first element with 2: 5000
Min for first element with 2: 0
Average: 1889.3333333333333
Max for first element with 3: 89
Min for first element with 3: 28
Average: 64.0
... and so on