I have a series of data processing as below:
- I have two list which contain the data I need.
- I append the lists into a new list. [tableList]
- Convert the list into dataframe and exported it into csv file. [tableDf]
Here's simplified contents of tableList:
Category CategoryName time(s) Power Vapor
1 A 1625448301 593233.36 3353.92
1 A 1625449552 595156.24 3286.8
1 A 1625450802 593833.36 3855.42
2 B 1625452051 595233.37 3353.95
2 B 1625453301 593535.86 3252.92
2 B 1625454552 593473.36 3364.15
3 C 1625455802 593754.32 3233.92
3 C 1625457052 593153.46 3563.52
3 C 1625458301 593854.56 3334.94
4 D 1625459552 593345.75 3353.36
4 D 1625460802 592313.24 3674.95
4 D 1625460802 592313.24 3673.35
1 A 1625463301 597313.23 3658.46
1 A 1625464552 595913.68 3789.45
....
The data inside is divided by category, and the category occurence is not always in a similar pattern.
Note: The time column data is datetime in unix format.
Here's the planned result that I want to achieve:
Category CategoryName TotalTime(s) Power Vapor
1 A (Total time diff 1) (Power SUM 2) (Vapor SUM 1)
2 B (Total time diff 2) (Power SUM 2) (Vapor SUM 2)
3 C (Total time diff 3) (Power SUM 2) (Vapor SUM 3)
4 D (Total time diff 4) (Power SUM 2) (Vapor SUM 4)
The data is grouped per category, while sum of Power and Vapor can be simply achieved by using SUM function in grouped category. I am stuck at calculating the total time.
For example in the first occurence of the category 1, the difference between the last and first time is 2501 (1625450802 - 1625448301).
In the next occurence of category 1, the diff between last and first is 2600. All that diff values is combined to create Total time diff 1
I have tried using pd.diff() and the answer from this other question
tableDf['TotalTime(s)'] = tableDf.groupby('Category')['time(s)'].transform(lambda x: x.iat[-1] - x.iat[0])
But all of that approach only calculates the last row and first row of the category 1. Which resulted in wrong total time.
Any solution or suggestion to calculate the difference between last and first row per occurring category?
CodePudding user response:
just to provide a convtools based alternative option:
from convtools import conversion as c
from convtools.contrib.tables import Table
# this is an ad hoc converter function; consider generating it once and reusing
# further
converter = (
c.chunk_by(c.item("Category"))
.aggregate(
{
"Category": c.ReduceFuncs.First(c.this).item("Category"),
"CategoryName": c.ReduceFuncs.First(c.this).item("CategoryName"),
"TotalTime(s)": (
c.ReduceFuncs.Last(c.this).item("time(s)")
- c.ReduceFuncs.First(c.this).item("time(s)")
),
"Power": c.ReduceFuncs.Sum(c.item("Power")),
"Vapor": c.ReduceFuncs.Sum(c.item("Vapor")),
}
)
.gen_converter()
)
column_types = {
"time(s)": int,
"Power": float,
"Vapor": float,
}
# this is iterable, so can be consumed only once
prepared_rows_iter = (
Table.from_csv("tmp4.csv", header=True)
# casting column types
.update(
**{
column_name: c.col(column_name).as_type(column_type)
for column_name, column_type in column_types.items()
}
).into_iter_rows(dict)
)
# if list of dicts is needed
result = list(converter(prepared_rows_iter))
assert result == [
{ "Category": "1", "CategoryName": "A", "TotalTime(s)": 2501, "Power": 1782222.96, "Vapor": 10496.14, },
{ "Category": "2", "CategoryName": "B", "TotalTime(s)": 2501, "Power": 1782242.5899999999, "Vapor": 9971.02, },
{ "Category": "3", "CategoryName": "C", "TotalTime(s)": 2499, "Power": 1780762.3399999999, "Vapor": 10132.380000000001, },
{ "Category": "4", "CategoryName": "D", "TotalTime(s)": 1250, "Power": 1777972.23, "Vapor": 10701.66, },
{ "Category": "1", "CategoryName": "A", "TotalTime(s)": 1251, "Power": 1193226.9100000001, "Vapor": 7447.91, },
]
# if csv file is needed
# Table.from_rows(converter(prepared_rows_iter)).into_csv("out.csv")
CodePudding user response:
Here is a solution with datar
, which re-imagines pandas' APIs:
Construct data
>>> from datar.all import f, tribble, group_by, summarise, first, last, sum, relocate
[2022-03-23 10:11:46][datar][WARNING] Builtin name "sum" has been overriden by datar.
>>>
>>> df = tribble(
... f.Category, f.CategoryName, f["time(s)"], f.Power, f.Vapor,
... 1, "A", 1625448301, 593233.36, 353.92,
... 1, "A", 1625449552, 595156.24, 286.8,
... 1, "A", 1625450802, 593833.36, 855.42,
... 2, "B", 1625452051, 595233.37, 353.95,
... 2, "B", 1625453301, 593535.86, 252.92,
... 2, "B", 1625454552, 593473.36, 364.15,
... 3, "C", 1625455802, 593754.32, 233.92,
... 3, "C", 1625457052, 593153.46, 563.52,
... 3, "C", 1625458301, 593854.56, 334.94,
... 4, "D", 1625459552, 593345.75, 353.36,
... 4, "D", 1625460802, 592313.24, 674.95,
... 4, "D", 1625460802, 592313.24, 673.35,
... )
Manipulate data
>>> (
... df
... >> group_by(f.Category)
... >> summarise(
... Power=sum(f.Power),
... Vapor=sum(f.Vapor),
... CategoryName=first(f.CategoryName),
... **{
... "TotalTime(s)": last(f["time(s)"]) - first(f["time(s)"]),
... }
... )
... >> relocate(f.CategoryName, f["TotalTime(s)"], _after=f.Category)
... )
Category CategoryName TotalTime(s) Power Vapor
<int64> <object> <int64> <float64> <float64>
0 1 A 2501 1782222.96 1496.14
1 2 B 2501 1782242.59 971.02
2 3 C 2499 1780762.34 1132.38
3 4 D 1250 1777972.23 1701.66
CodePudding user response:
You can readily do this in pandas
, you just need to use a little groupby trick to create groupings from consecutive categories, and then apply your operaions:
consec_groupings = (
df['Category'].shift()
.ne(df['Category'])
.groupby(df['Category']).cumsum()
.rename('Consec_Category')
)
intermediate = (
df.groupby(['Category', 'CategoryName', groupings])
.agg({'time(s)': ['first', 'last'], 'Power': 'sum', 'Vapor': 'sum'})
)
intermediate[('time(s)', 'delta')] = (
intermediate[('time(s)', 'last')] - intermediate[('time(s)', 'first')]
)
print(intermediate)
time(s) Power Vapor time(s)
first last sum sum delta
Category CategoryName Consec_Category
1 A 1 1625448301 1625450802 1782222.96 10496.14 2501
2 1625463301 1625464552 1193226.91 7447.91 1251
2 B 1 1625452051 1625454552 1782242.59 9971.02 2501
3 C 1 1625455802 1625458301 1780762.34 10132.38 2499
4 D 1 1625459552 1625460802 1777972.23 10701.66 1250
Then from that intermediate product, you can calculate the final output quite readily:
final = (
intermediate[[('time(s)', 'delta'), ('Power', 'sum'), ('Vapor', 'sum')]]
.droplevel(level=1, axis=1)
.groupby(['Category', 'CategoryName']).sum()
)
print(final)
time(s) Power Vapor
Category CategoryName
1 A 3752 2975449.87 17944.05
2 B 2501 1782242.59 9971.02
3 C 2499 1780762.34 10132.38
4 D 1250 1777972.23 10701.66