I am trying to write a program that iterates through data values and adds them to a dictionary (from a csv file), while giving a running total of how many times that data value appears in the list of values I have. I am able to do this but I need to add a range(not the range func.), for example if current value is within or - .50 of another then it'll take the average and add another or the running total.
data = {}
file = open(fname)
#Create value dictionary, add running count to repeated values
for line in file:
rows = line.split(",")
for i in range(4):
price = rows[i]
price = float(price)
newnum = price
data[price] = data.get(price, 0) 1
#Get top 10 most common values
top_dogs = {}
for i in range(10):
key = max(data, key=data.get)
value = data.pop(key)
top_dogs[key] = value
print(top_dogs)
CodePudding user response:
In general, dicts don't have a capability for matching ranges, so you either need to collapse the range to a single value and use another data structure such as a sorted list.
As example of the first technique, the round()` function will suffice will suffice for finding values with " or - .50" of one another:
data = [10.1, 11.2, 10.5, 12.5, 10.2, 12.6, 11.4, 11.7, 11.8]
d = {}
for x in data:
k = round(x)
d[k] = d.get(k, 0) 1
For the second technique, you can maintain a sorted list with the bisect module which is good at searching ranges and maintaining search order.
from statistics import mean
from bisect import bisect_left, bisect_right, insort
data = [10.1, 11.2, 10.5, 12.5, 10.2, 12.6, 11.4, 11.7, 11.8]
d = {}
sorted_list = []
for x in data:
lo = bisect_left(sorted_list, x - 0.5)
hi = bisect_right(sorted_list, x 0.5)
if lo == hi:
new_x = x
new_count = 1
else:
old_x = sorted_list.pop(lo)
new_x = mean([old_x, x])
new_count = d.pop(old_x) 1
d[new_x] = new_count
insort(sorted_list, new_x)
Note 1: This code can be tweaked further so that if multiple values are in the lo:hi range, the closest one to x can be updated. For example, if the sorted_list contained [10.1, 10.8]
, both values are within 0.50 of 10.5, but 10.8 should be selected for update because it is closer to 10.5.
Note 2: The request to average the inputs likely isn't the right thing to do because it weights the most recently seen input more than the earlier inputs. A better result can be had by keeping a list of all nearby inputs and then averaging them at the end.
Note 3: Rather than the algorithm as requested, it may be better to sort all the inputs, then scan for blocks where all values lie in a specified interval:
from statistics import mean
data = [10.1, 11.2, 10.5, 12.5, 10.2, 12.6, 11.4, 11.7, 11.8]
data.sort()
d = {}
equivalents = []
for x in data:
if not equivalents or x < equivalents[0] 1.0:
equivalents.append(x)
else:
d[mean(equivalents)] = len(equivalents)
equivalents.clear()
if equivalents:
d[mean(equivalents)] = len(equivalents)
equivalents.clear()