Aggregates in python-CodePudding

I have a txt file with the following content:

ID1;ID2;TIME;VALUE  
1000;100;012021;12
1000;100;022021;4129
1000;100;032021;128
1000;100;042021;412
1000;100;052021;12818
1000;120;022021;4129
1000;100;062021;546
1000;100;072021;86
1000;120;052021;12818
1000;100;082021;754
1000;100;092021;2633
1000;100;102021;571
1000;120;092021;2633
1000;100;112021;2233
1000;100;122021;571
1000;120;012021;12
1000;120;032021;128
1000;120;042021;412
1000;120;062021;546
1000;120;072021;86
1000;120;082021;754
1000;120;102021;571
1000;120;112021;2233
1000;120;122021;571
1000;100;012022;12
1000;100;022022;4129
1000;120;022022;4129
1000;120;032022;128
1000;120;042022;412
1000;100;032022;128
1000;100;042022;412
1000;100;052022;12818
1000;100;062022;546
1000;100;072022;86
1000;100;082022;754
1000;120;072022;86
1000;120;082022;754
1000;120;092022;2633
1000;120;102022;571
1000;100;092022;2633
1000;100;102022;571
1000;100;112022;2233
1000;100;122022;571
1000;120;012022;12
1000;120;052022;12818
1000;120;062022;546
1000;120;112022;2233
1000;120;122022;571

I need to make aggregates of time (half year, total year), using the items from column time, which have the same ID1, ID2 and sum up the values.

The output should look like this:

I would appreciate your help! This is what I have so far for half year: #already sorted by time

data=open("file.txt").readlines()
count = 0
for line in data:
    count  = 1
    for n in range(count - 1, len(data), 6):
       subList = [data[n:n   6]]
       break

CodePudding user response：

I'm far from being a Python expert but how about something like:

    dd = defaultdict(lambda: [])
    rows = [elems for elems in [line.strip().split(';') for line in data[1:]]]
    for row in rows:
        mm = row[2][:2]
        yy = row[2][2:]
        vv = int(row[3])
        key = (row[0], row[1], yy)
        dd[key].append([mm, yy, vv])

    # print("Total of all values", sum(int(row[3]) for row in rows))

    for k, v in dd.items():
        h1 = sum(c[2] for c in v if c[0] <= '06')
        h2 = sum(c[2] for c in v if c[0] > '06')
        tt = sum(c[2] for c in v)
        # or, much more simply, tt = h1   h2

        # print(k[0], k[1], k[2], "H1:", h1, "H2:", h2, "TT:", tt)
        print(f"{k[0]};{k[1]};HY1{k[2]};{h1}")
        print(f"{k[0]};{k[1]};HY2{k[2]};{h2}")
        print(f"{k[0]};{k[1]};TY{k[2]};{tt}")

Seems to give correct results for the data supplied. Might not be efficient if you have huge amounts of data. YMMV.

CodePudding user response：

This is a task that is very well suited for the pandas library, which is designed to work with tabular data. A way to do this in pandas would be something like this:

import pandas as pd
# Read the data from the textfile (here I saved it as data.csv).
# Make sure the TIME column is not read as an integer by declaring 
# dtype={'TIME': object}, because this would omit the leading zeroes, which 
# we need for conversion to datetime object
df = pd.read_csv('data.csv', delimiter=';',  dtype={'TIME': object})
# Convert TIME column to datetime
df['TIME'] = pd.to_datetime(df['TIME'])
# Create new column with year
df['Y'] = df['TIME'].dt.year
# Create new column with halfyear (1 = first halfyear, 2 = second halfyear)
df['HY'] = df['TIME'].dt.month.floordiv(7).add(1)

After this, your table looks like this:

df.head() # Just show the first couple of rows

    ID1     ID2 TIME        VALUE   Y       HY
0   1000    100 2021-01-20  12      2021    1
1   1000    100 2021-02-20  4129    2021    1
2   1000    100 2021-03-20  128     2021    1
3   1000    100 2021-04-20  412     2021    1
4   1000    100 2021-05-20  12818   2021    1

Getting the table into the desired format takes a bit of work, but grouping and aggregating then becomes really easy. You can then also perform other grouping and aggregating operations as you please without having to code it all from hand.

To group by year and calculate the sum:

df.groupby(['ID1', 'ID2', 'Y']).sum()

ID1     ID2 Y       VALUE  
1000    100 2021    24893
1000    100 2022    24893
1000    120 2021    24893
1000    120 2022    24893

To group by halfyear and calculate the sum:

df.groupby(['ID1', 'ID2', 'Y', 'HY']).sum()

ID1     ID2 Y       HY  VALUE  
1000    100 2021    1   18045
1000    100 2021    2   6848
1000    100 2022    1   18045
1000    100 2022    2   6848
1000    120 2021    1   18045
1000    120 2021    2   6848
1000    120 2022    1   18045
1000    120 2022    2   6848