Home > database >  Python sum of values in dataset
Python sum of values in dataset

Time:05-14

I have this dataframe (ID is a string and Value a float):

ID              Value
1               0.0
1.1             0.0
1.2             0.0
1.2.1           27508.42
1.2.2           25861.82
1.3             0.0
1.3.1           0.0
1.3.1.1         0.0
1.3.1.2         0.0
1.3.1.3         30396.25

Whose structure works like this:

1
├── 1.1  
├── 1.2  
│   ├── 1.2.1  
│   └── 1.2.2  
└── 1.3  
    └── 1.3.1  
        ├── 1.3.1.1
        ├── 1.3.1.2    
        └── 1.3.1.3  

And need for the value of the 'parent' node to be the sum of the leaves. So:

ID              Value
1               83766.489    (1.1   1.2   1.3)
1.1             0.0
1.2             53370.24     (1.2.1   1.2.2)
1.2.1           27508.42
1.2.2           25861.82
1.3             30396.25     (1.3.1)
1.3.1           30396.25     (1.3.1.1   1.3.1.2   1.3.1.3)
1.3.1.1         0.0
1.3.1.2         0.0
1.3.1.3         30396.25

How can I group the IDs? Using groupby wont work since all the IDs are unique. Should I change the structure of the dataframe to better reflect the logic of the schema?

CodePudding user response:

You could find which IDs make up the "child ID" of each ID and then sum across these "child ID"s

from itertools import tee
from collections import defaultdict
d = defaultdict(list)
a, b = tee(df['ID'].values)
b = list(b)
for a_val in a:
    for b_val in b:
        if b_val.startswith(a_val):
            d[a_val].append(b_val)
d
for b_val in b:
    df.loc[df['ID'] == b_val, 'total'] = sum(df.loc[df['ID'].isin(d[b_val]), 'Value'])
print(df)
        ID     Value     total
0        1      0.00  83766.49
1      1.1      0.00      0.00
2      1.2      0.00  53370.24
3    1.2.1  27508.42  27508.42
4    1.2.2  25861.82  25861.82
5      1.3      0.00  30396.25
6    1.3.1      0.00  30396.25
7  1.3.1.1      0.00      0.00
8  1.3.1.2      0.00      0.00
9  1.3.1.3  30396.25  30396.25

CodePudding user response:

Another solution (assuming column ID is sorted):

def counter(x):
    out = []
    for id_, v in zip(x.index, x):
        s = sum(
            v
            for a, v in out
            if a.startswith(id_) and id_.count(".") == a.count(".") - 1
        )
        out.append((id_, s   v))
    return [v for _, v in out]


print(df.set_index("ID")[::-1].apply(counter)[::-1].reset_index())

Prints:

        ID     Value
0        1  83766.49
1      1.1      0.00
2      1.2  53370.24
3    1.2.1  27508.42
4    1.2.2  25861.82
5      1.3  30396.25
6    1.3.1  30396.25
7  1.3.1.1      0.00
8  1.3.1.2      0.00
9  1.3.1.3  30396.25
  • Related