Pandas Dataframe - How to calculate the difference by first row and last row and sum it in reoccurin-CodePudding

I have a series of data processing as below:

I have two list which contain the data I need.
I append the lists into a new list. [tableList]
Convert the list into dataframe and exported it into csv file. [tableDf]

Here's simplified contents of tableList:

Category      CategoryName     time(s)      Power       Vapor
   1               A          1625448301   593233.36    3353.92
   1               A          1625449552   595156.24    3286.8
   1               A          1625450802   593833.36    3855.42
   2               B          1625452051   595233.37    3353.95
   2               B          1625453301   593535.86    3252.92
   2               B          1625454552   593473.36    3364.15
   3               C          1625455802   593754.32    3233.92
   3               C          1625457052   593153.46    3563.52
   3               C          1625458301   593854.56    3334.94
   4               D          1625459552   593345.75    3353.36
   4               D          1625460802   592313.24    3674.95
   4               D          1625460802   592313.24    3673.35
   1               A          1625463301   597313.23    3658.46
   1               A          1625464552   595913.68    3789.45
   ....

The data inside is divided by category, and the category occurence is not always in a similar pattern.
Note: The time column data is datetime in unix format.

Here's the planned result that I want to achieve:

Category      CategoryName    TotalTime(s)           Power           Vapor
       1           A          (Total time diff 1) (Power SUM 2)    (Vapor SUM 1)
       2           B          (Total time diff 2) (Power SUM 2)    (Vapor SUM 2)
       3           C          (Total time diff 3) (Power SUM 2)    (Vapor SUM 3)
       4           D          (Total time diff 4) (Power SUM 2)    (Vapor SUM 4)

The data is grouped per category, while sum of Power and Vapor can be simply achieved by using SUM function in grouped category. I am stuck at calculating the total time.

For example in the first occurence of the category 1, the difference between the last and first time is 2501 (1625450802 - 1625448301).

In the next occurence of category 1, the diff between last and first is 2600. All that diff values is combined to create Total time diff 1

I have tried using pd.diff() and the answer from this other question

tableDf['TotalTime(s)'] = tableDf.groupby('Category')['time(s)'].transform(lambda x: x.iat[-1] - x.iat[0])

But all of that approach only calculates the last row and first row of the category 1. Which resulted in wrong total time.

Any solution or suggestion to calculate the difference between last and first row per occurring category?

CodePudding user response：

just to provide a convtools based alternative option:

from convtools import conversion as c
from convtools.contrib.tables import Table


# this is an ad hoc converter function; consider generating it once and reusing
# further
converter = (
    c.chunk_by(c.item("Category"))
    .aggregate(
        {
            "Category": c.ReduceFuncs.First(c.this).item("Category"),
            "CategoryName": c.ReduceFuncs.First(c.this).item("CategoryName"),
            "TotalTime(s)": (
                c.ReduceFuncs.Last(c.this).item("time(s)")
                - c.ReduceFuncs.First(c.this).item("time(s)")
            ),
            "Power": c.ReduceFuncs.Sum(c.item("Power")),
            "Vapor": c.ReduceFuncs.Sum(c.item("Vapor")),
        }
    )
    .gen_converter()
)

column_types = {
    "time(s)": int,
    "Power": float,
    "Vapor": float,
}

# this is iterable, so can be consumed only once
prepared_rows_iter = (
    Table.from_csv("tmp4.csv", header=True)
    # casting column types
    .update(
        **{
            column_name: c.col(column_name).as_type(column_type)
            for column_name, column_type in column_types.items()
        }
    ).into_iter_rows(dict)
)

# if list of dicts is needed
result = list(converter(prepared_rows_iter))
assert result == [
    { "Category": "1", "CategoryName": "A", "TotalTime(s)": 2501, "Power": 1782222.96, "Vapor": 10496.14, },
    { "Category": "2", "CategoryName": "B", "TotalTime(s)": 2501, "Power": 1782242.5899999999, "Vapor": 9971.02, },
    { "Category": "3", "CategoryName": "C", "TotalTime(s)": 2499, "Power": 1780762.3399999999, "Vapor": 10132.380000000001, },
    { "Category": "4", "CategoryName": "D", "TotalTime(s)": 1250, "Power": 1777972.23, "Vapor": 10701.66, },
    { "Category": "1", "CategoryName": "A", "TotalTime(s)": 1251, "Power": 1193226.9100000001, "Vapor": 7447.91, },
]

# if csv file is needed
# Table.from_rows(converter(prepared_rows_iter)).into_csv("out.csv")

CodePudding user response：

Here is a solution with datar, which re-imagines pandas' APIs:

Construct data

>>> from datar.all import f, tribble, group_by, summarise, first, last, sum, relocate
[2022-03-23 10:11:46][datar][WARNING] Builtin name "sum" has been overriden by datar.
>>> 
>>> df = tribble(
...     f.Category,  f.CategoryName, f["time(s)"], f.Power,   f.Vapor,
...     1,           "A",            1625448301,   593233.36, 353.92,
...     1,           "A",            1625449552,   595156.24, 286.8,
...     1,           "A",            1625450802,   593833.36, 855.42,
...     2,           "B",            1625452051,   595233.37, 353.95,
...     2,           "B",            1625453301,   593535.86, 252.92,
...     2,           "B",            1625454552,   593473.36, 364.15,
...     3,           "C",            1625455802,   593754.32, 233.92,
...     3,           "C",            1625457052,   593153.46, 563.52,
...     3,           "C",            1625458301,   593854.56, 334.94,
...     4,           "D",            1625459552,   593345.75, 353.36,
...     4,           "D",            1625460802,   592313.24, 674.95,
...     4,           "D",            1625460802,   592313.24, 673.35,
... )

Manipulate data

>>> (
...     df 
...     >> group_by(f.Category) 
...     >> summarise(
...         Power=sum(f.Power),
...         Vapor=sum(f.Vapor),
...         CategoryName=first(f.CategoryName),
...         **{
...             "TotalTime(s)": last(f["time(s)"]) - first(f["time(s)"]),
...         }
...     ) 
...     >> relocate(f.CategoryName, f["TotalTime(s)"], _after=f.Category)
... )
   Category CategoryName  TotalTime(s)       Power     Vapor
    <int64>     <object>       <int64>   <float64> <float64>
0         1            A          2501  1782222.96   1496.14
1         2            B          2501  1782242.59    971.02
2         3            C          2499  1780762.34   1132.38
3         4            D          1250  1777972.23   1701.66

CodePudding user response：

You can readily do this in pandas, you just need to use a little groupby trick to create groupings from consecutive categories, and then apply your operaions:

consec_groupings = (
    df['Category'].shift()
    .ne(df['Category'])
    .groupby(df['Category']).cumsum()
    .rename('Consec_Category')
)

intermediate = (
    df.groupby(['Category', 'CategoryName', groupings])
    .agg({'time(s)': ['first', 'last'], 'Power': 'sum', 'Vapor': 'sum'})
)

intermediate[('time(s)', 'delta')] = (
    intermediate[('time(s)', 'last')] - intermediate[('time(s)', 'first')]
)

print(intermediate)
                                          time(s)                   Power     Vapor time(s)
                                            first        last         sum       sum   delta
Category CategoryName Consec_Category                                                      
1        A            1                1625448301  1625450802  1782222.96  10496.14    2501
                      2                1625463301  1625464552  1193226.91   7447.91    1251
2        B            1                1625452051  1625454552  1782242.59   9971.02    2501
3        C            1                1625455802  1625458301  1780762.34  10132.38    2499
4        D            1                1625459552  1625460802  1777972.23  10701.66    1250

Then from that intermediate product, you can calculate the final output quite readily:

final = (
    intermediate[[('time(s)', 'delta'), ('Power', 'sum'), ('Vapor', 'sum')]]
    .droplevel(level=1, axis=1)
    .groupby(['Category', 'CategoryName']).sum()
)

print(final)
                       time(s)       Power     Vapor
Category CategoryName                               
1        A                3752  2975449.87  17944.05
2        B                2501  1782242.59   9971.02
3        C                2499  1780762.34  10132.38
4        D                1250  1777972.23  10701.66