I am a newbie into python and I am trying to optimize a snippet of my program from using two for loops to using the list comprehension in Python3. While doing this, I am unable to calculate the sum of more than one column. For Ex, if I have columns 1,2, and 3 of types float int, and string in a dictionary, I am able to calculate the sum only for column one and not for column 2.
The double for loop looks something like this
final_dict = []
for k, g in itertools.groupby(temp_dict, key=lambda x: (x['transaction_category_id'])):
txn_amount = 0
distinct_user_count = 0
for v in g:
# print(k)
txn_amount = float(v['transaction_amount'])
distinct_user_count = v['user_id_count']
# print(v)
final_dict.append({'transaction_category_id': k, 'aggregated_transaction_amount': txn_amount,
'distinct_user_count': distinct_user_count})
The code I want to optimise to should ideally look something like this :
final_result = [[k, sum(float(v['transaction_amount']) for v in g),sum(s['user_id_count'] for s in g)] for k, g in
itertools.groupby(temp_dict, key=lambda x: (x['transaction_category_id']))]
But the code does not add up values for the user_id_count
column and return sum as 0.
The sample data looks something like this :
user_id,transaction_amount,transaction_category_id
b2d30a62-36bd-41c6-8221-987d5c4cd707,63.05,3
b2d30a62-36bd-41c6-8221-987d5c4cd707,13.97,4
b2d30a62-36bd-41c6-8221-987d5c4cd707,97.15,4
b2d30a62-36bd-41c6-8221-987d5c4cd707,23.54,5
and the ideal output would look like :
['4', 111.12, 2],
['3', 63.05, 1],
['5', 23.54, 1],
but it prints out and the ideal output would look like :
['4', 111.12, 0],
['3', 63.05, 0],
['5', 23.54, 0],
I tried the below sample code out but the output is not what I expected :
final_result = [[k, sum(float(v['transaction_amount']) for v in g),sum(s['user_id_count'] for s in g)] for k, g in
itertools.groupby(temp_dict, key=lambda x: (x['transaction_category_id']))]
CodePudding user response:
You seem to have a very simple comma-delimited CSV file. To get output similar to that shown in the question you could do this:
from collections import defaultdict
FILENAME = '/Volumes/G-Drive/sample.csv'
def gendict():
return {'amount': 0.0, 'count': 0}
summary = defaultdict(gendict)
with open(FILENAME) as csv:
next(csv, None) # skip column headers
for line in map(str.rstrip, csv):
*_, _amount, _id = line.split(',')
summary[_id]['amount'] = float(_amount)
summary[_id]['count'] = 1
slist = [[k, *v.values()] for k, v in summary.items()]
print(slist)
Output:
[['3', 63.05, 1], ['4', 111.12, 2], ['5', 23.54, 1]]
CodePudding user response:
Alternatively, you can use csv.DictReader
with collections.defaultdict
to simulate the groupby.agg
of pandas:
from csv import DictReader
from collections import defaultdict
out = defaultdict(lambda: [0,0])
with open("tmp/Ashish Shetty.csv", "r") as csvfile:
reader = DictReader(csvfile)
for row in reader:
tra_id = str(row["transaction_category_id"])
out[tra_id][0] = float(row["transaction_amount"])
out[tra_id][1] = 1
out = [[k] v for k,v in out.items()]
Output :
print(out)
#[['3', 63.05, 1], ['4', 111.12, 2], ['5', 23.54, 1]]