A loop that makes multi-conditional summations-CodePudding

I have a data frame of the form:

df = [["john","2019","30.2"] , ["john","2019","40"] , ["john","2020","50.3"] , 
      ["amy","2019","60"] , ["amy","2019","20"] , ["amy","2020","40.1"]]

my desired result would be a list of multi-conditional summations of the last index while the first two are equal:

> [["john", "2019", "70.2"] ,  ["john","2020","50.3"] , ["amy","2019","80"] , ["amy","2020","40.1"]]

What I tried to do, was a for loop that checks equality for each condition and then sums up the last index, if conditions are true – this is some kind of a pseudo-code:

for i in df[i]:
   if df[i][0] == df[i 1][0] and df[i][1] == df[i 1][1]: #if both conditions are true
      sum1 = sum(float(df[i][2]))
      lst = []
      lst.append(df[i][0])
      lst.append(df[i][1])
      lst.append(str(sum1))

Edit: Would appreciate a solution that doesn't use packages.

CodePudding user response：

Following code doesn't use any package. Starting from Python 3.7 all dicts are insertion-ordered, this fact is used in following code so that final result has order of original appearance of elements. If for some reason your python is below 3.7, tell me, I'll modify code to explicitly do ordering instead of relying on this language feature.

Try it online!

df = [["john","2019","30.2"], ["john","2019","40"], ["john","2020","50.3"],
      ["amy","2019","60"], ["amy","2019","20"], ["amy","2020","40.1"]]

r = {}
for *a, b in df:
    a = tuple(a)
    if a not in r:
       r[a] = 0
    r[a]  = float(b)
r = [list(k)   [str(v)] for k, v in r.items()]

print(r)

Output:

[['john', '2019', '70.2'], ['john', '2020', '50.3'], ['amy', '2019', '80.0'], ['amy', '2020', '40.1']]

CodePudding user response：

Since you are using df variable name I am assuming you are familiar with pandas.

You can easily do this in pandas. Just convert your list into df.

And the groupby columns which you want unique values and select the last row

df.groupby(['col_a', 'col_b'], as_index=False).last()

You can sort the df before calling groupby if you have any custom logic

CodePudding user response：

Here's a way to do it using defaultdict:

from collections import defaultdict
sums = defaultdict(lambda: defaultdict(float))
for item in df:
    sums[item[0]][item[1]]  = float(item[2])
lst = [[key, inner_key, value] for key in sums for inner_key, value in sums[key].items()]

CodePudding user response：

Dictionaries have the convenient setdefault method, that checks if its 1st argument is a key of the dictionary, and either return the corresponding value or a default value.

In our case, because we want to sum numerical values, of course the default must be 0.

We use a temporary dictionary, indexed by the tuple (name, year), and when we are finished with the summing we unfold the dictionary data into a list of lists, following the direction you showed in the question's pseudo-code.

In [15]: data = [["john","2019","30.2"] , ["john","2019","40"] , ["john","2020","50.3"] ,
    ...:         ["amy","2019","60"] , ["amy","2019","20"] , ["amy","2020","40.1"]]
    ...: d_temp = {}
    ...: for n, y, v in data:
    ...:     d_temp[(n,y)] = d_temp.setdefault((n,y),0) float(v)
    ...: lol = [list(k) [v] for k, v in d_temp.items()]
    ...: lol
Out[15]: 
[['john', '2019', 70.2],
 ['john', '2020', 50.3],
 ['amy', '2019', 80.0],
 ['amy', '2020', 40.1]]

CodePudding user response：

One option, using tools within the standard library:

from itertools import groupby
from decimal import Decimal
from operator import itemgetter

# itertools' groupby requires the data to be sorted
key_func = itemgetter(0,1)
df = sorted(df, key = key_func)

# compute values within the groupby
[[*key, str(sum(Decimal(e) for *_, e in ent))] 
  for key, ent 
  in groupby(df, key = key_func)]

[['amy', '2019', '80'],
 ['amy', '2020', '40.1'],
 ['john', '2019', '70.2'],
 ['john', '2020', '50.3']]