Home > OS >  pandas groupby performance / combine 2 functions
pandas groupby performance / combine 2 functions

Time:03-05

I am learning python and trying to understand the best practices of data queries. Here is some dummy data (customer sales) to test

import pandas as pd
 
df = pd.DataFrame({'Name':['tom', 'bob', 'bob', 'jack', 'jack', 'jack'],'Amount':[3, 2, 5, 1, 10, 100], 'Date':["01.02.2022", "02.02.2022", "03.02.2022", "01.02.2022", "03.02.2022", "05.02.2022"]}) 
df.Date = pd.to_datetime(df.Date, format='%d.%m.%Y')

I want to investigate 2 kinds of queries:

  1. How long is a person our customer?
  2. What is the period between first and last purchase.

How can I run the first query without writing loops manually?

What I have done so far for the second part is this

result = df.groupby("Name").max() - df.groupby("Name").min()

Is it possible to combine these two groupby queries into one to improve the performance?

P.S. I am trying to understand pandas and key concepts how to optimize queries. Different approaches and explanations are highly appreciated.

CodePudding user response:

You can use GroupBy.agg with a custom function to get the difference between the max and min date.

df.groupby('Name')['Date'].agg(lambda x: x.max()-x.min())

As you already have datetime type, this will nicely yield a Timedelta object, which by default is shown as a string in the form 'x days'.

You can also save the GroupBy object in a variable and reuse it. This way, computation of the groups occurs only once:

g = df.groupby("Name")['Date']
g.max() - g.min()

output:

Name
bob    1 days
jack   4 days
tom    0 days
Name: Date, dtype: timedelta64[ns]
  • Related