I'm having problem with python while constructing a loop (or a nested loop) for my calculation. I have a very large data set (about 600k entries) with repeating ID's and their values. I'd like to form a loop which starts from the first entry, looks if it is repeating and if it is how manty times it does and calculates average of values for the same ID.
For example in below table I'd like it to return Average a = 2, Average b = 4 and Average c = 1.
I tried a nested for loop below but completely lost my track.
ID | Value |
---|---|
a | 1 |
a | 3 |
b | 3 |
b | 4 |
b | 5 |
c | 1 |
for i in range(0,len(ID)):
if ID[i] == ID[i 1]: #if the next ID is equal to the current one
for j in range (i,len(ID)): #look for how many repating values for current ID
if ID[j] != ID[j 1]:
i=j
break
#since it's know how many repeating ID's exist, somehow average them and continue for the second (distinct) ID
CodePudding user response:
This is a snippet of code that does not use any external modules. You are able to use them if you wish, but thought to leave this as it is still an option.
def avg(lst):
return sum(lst) / len(lst)
filename = "mycsvfile.csv"
averages = dict()
all_values = dict()
with open(filename, "r") as f:
# read all lines and save
for line in f.readlines():
split = line.split(",") # split on comma
row_id = split[0]
try:
# "ID is in the first column Value is at fifth." -OP
value = float(split[4])
# check if there is a new ID or not
if row_id in all_values.keys():
all_values[row_id].append(value)
else:
all_values[row_id] = [value]
except ValueError:
pass # ignore header lines
# get averages into dictionary
for k, v in all_values.items():
averages[k] = avg(v)
# print the data to show it is correct
for k, v in averages.items():
print(f"{k}: {v}")
This snippet of code reads in all data from a CSV file and then gets the averages into a dictionary. Key is the ID, value is the average.
How this works: This code creates 2 dictionaries:
# all_values:
{ID: [list_all_values_here]}
# for example
{'b': [3, 4, 5]}
#####
# averages:
{ID: average}
# for example
{'b': 3}
This starts off by reading in all data from the CSV file and putting it into the respective all_values dictionary. After reading all data, we iterate over all key: value pairs in the all_values dictionary and get the average of the list containing all values.
We iterate over all key:value pairs here in these 2 lines and add the average to the averages dictionary.
for k, v in all_values.items():
averages[k] = avg(v)
We then print out all values at the end to show that it is correct.
CodePudding user response:
In case you just need an algorithm: Since the data is sorted, you can iterate the outer loop on the ids, while storing two variables - one will count the number of identical ids, and the other one will sum up the values. When you come to an iteration where a new id is found, that's your cue to calculate the average (the sum of values divided by the id count), print it, and then do the same over and over for the next ids.
CodePudding user response:
You could do something like this - iterate through the IDs and values, comparing the current ID to the previous ID, and when it is different, calculating and storing the average.
ID = ["a", "a", "b", "b", "b", "c"]
values = [1,3,3,4,5,1]
averages = {}
previous_id = None
count = None
total = None
for id_, value in zip(ID, values):
if id_ == previous_id:
count = 1
total = value
else:
if previous_id is not None:
averages[previous_id] = total / count
count = 1
total = value
previous_id = id_
averages[id_] = total / count # to ensure we don't miss the last ID
But it will be much better and faster to use pandas
:
import pandas as pd
df = pd.DataFrame({"id": ID, "values": values}) # or use read_excel to import from your spreadsheet
print(df.groupby("id").mean())