I am desperate over a data analysis task that I would like to perform on a dataframe in python. So, this is the dataframe that I have:
df = pd.DataFrame({"Person": ["P1", "P1","P1","P1","P1","P1","P1","P1","P1","P1", "P2", "P2","P2","P2","P2","P2","P2","P2","P2","P2"],
"Activity": ["A", "A", "A", "B", "A", "A", "A", "A", "A", "A", "A", "A", "A", "B", "A", "A", "B", "A", "B", "A"],
"Time": ["0", "0", "1", "1", "1", "3", "5", "5", "6", "6", "6", "6", "6", "6", "6", "6", "6", "6", "6", "6"]
})
I would like
- to find the number of groups with more than 2 consecutive repetitive activities "A" per Person and
- to calculate the average time of consecutive repetitive "A"s as end time minus start time for each group divided by the number of groups
I.e. the targeted resulting dataframe should look like this (AVGTime for P1 calculates as (1-0 6-1)/2):
solution = pd.DataFrame({"Person": ["P1", "P2"],
"Activity": ["A", "A"],
"Count": [2, 1],
"AVGTime": [3, 0]})
I understand there is kind of a close solution here: https://datascience-stackexchange-com.translate.goog/questions/41428/how-to-find-the-count-of-consecutive-same-string-values-in-a-pandas-dataframe?_x_tr_sl=en&_x_tr_tl=de&_x_tr_hl=de&_x_tr_pto=sc
However, the solution does not aggregate over a col, such as "Person" in my case. Also the solution does not seem to perform well given that I have a dataframe with about 7 Mio. rows.
I would really appreciate any hint!
CodePudding user response:
You can process the data as a stream without creating a dataframe, which should fit into memory. I'd suggest trying convtools library (I must confess - I'm the author).
Since you already have a dataframe, let's use it as an input:
import pandas as pd
from convtools import conversion as c
from convtools.contrib.tables import Table
# fmt: off
df = pd.DataFrame({
"Person": ["P1", "P1","P1","P1","P1","P1","P1","P1","P1","P1", "P2", "P2","P2","P2","P2","P2","P2","P2","P2","P2"],
"Activity": ["A", "A", "A", "B", "A", "A", "A", "A", "A", "A", "A", "A", "A", "B", "A", "A", "B", "A", "B", "A"],
"Time": ["0", "0", "1", "1", "1", "3", "5", "5", "6", "6", "6", "6", "6", "6", "6", "6", "6", "6", "6", "6"]
})
# fmt: on
# transforming DataFrame into an iterable of dicts not to allocate all rows at
# once by df.to_dict("records")
iter_rows = Table.from_rows(
df.itertuples(index=False), header=list(df.columns)
).into_iter_rows(dict)
result = (
# chunk by consecutive "person" "activity" pairs
c.chunk_by(c.item("Person"), c.item("Activity"))
.aggregate(
# each chunk gets transformed into a dict like this:
{
"Person": c.ReduceFuncs.First(c.item("Person")),
"Activity": c.ReduceFuncs.First(c.item("Activity")),
"length": c.ReduceFuncs.Count(),
"time": (
c.ReduceFuncs.Last(c.item("Time")).as_type(float)
- c.ReduceFuncs.First(c.item("Time")).as_type(float)
),
}
)
# remove short groups
.filter(c.item("length") > 2)
.pipe(
# now group by "person" "activity" pair to calculate avg time
c.group_by(c.item("Person"), c.item("Activity")).aggregate(
{
"Person": c.item("Person"),
"Activity": c.item("Activity"),
"avg_time": c.ReduceFuncs.Average(c.item("time")),
"number_of_groups": c.ReduceFuncs.Count(),
}
)
)
# should you want to reuse this conversion multiple times, run
# .gen_converter() to get a function and store it for further reuse
.execute(iter_rows)
)
Result:
In [37]: result
Out[37]:
[{'Person': 'P1', 'Activity': 'A', 'avg_time': 3.0, 'number_of_groups': 2},
{'Person': 'P2', 'Activity': 'A', 'avg_time': 0.0, 'number_of_groups': 1}]
CodePudding user response:
Try:
def group_func(x):
groups = []
for _, g in x.groupby((x["Activity"] != x["Activity"].shift()).cumsum()):
if len(g) > 2 and g["Activity"].iat[0] == "A":
groups.append(g)
avgs = sum(g["Time"].max() - g["Time"].min() for g in groups) / len(groups)
return pd.Series(
["A", len(groups), avgs], index=["Activity", "Count", "AVGTime"]
)
df["Time"] = df["Time"].astype(int)
x = df.groupby("Person", as_index=False).apply(group_func)
print(x)
Prints:
Person Activity Count AVGTime
0 P1 A 2 3.0
1 P2 A 1 0.0