Get value counts and date between for groupby-CodePudding

I have a dataframe that looks something like this:

df = pd.DataFrame(
[[1,'A','X','1/2/22 12:00:00AM'], 
[1,'A','X','1/3/22 12:00:00AM'], 
[1,'A','X','1/1/22 12:00:00AM'], 
[1,'A','X','1/2/22 1:00:00AM'], 
[1,'B','Y','1/1/22 1:00:00AM'],
[2,'A','Z','1/2/22 12:00:00AM'],
[2,'A','Z','1/1/22 12:00:00AM'], 
columns=['ID', 'Category', 'Site', 'Task Completed'])

ID	Category	Site	Task Completed
1	A	X	1/2/22 12:00:00AM
1	A	X	1/3/22 12:00:00AM
1	A	X	1/1/22 12:00:00AM
1	A	X	1/2/22 1:00:00AM
1	B	Y	1/1/22 1:00:00AM
2	A	Z	1/2/22 12:00:00AM
2	A	Z	1/1/22 12:00:00AM

As you can see, there can be multiple task completed dates for a ID/Category/Site combo.

What I want to find is the time difference (in days) between the first (min) Task Completed date and the last (max) task completed date for every ID/Category/Site combination within the dataset. I also want to find the number of instances for each ID/Category/Site combo. The intended result would look something like this:

ID	Category	Site	Time Difference	# of instances
1	A	X	2	4
1	B	Y	0	1
2	A	Z	1	2

So far, I know how to get the time difference and the value counts separately:

df['task_completed'] = pd.to_datetime(df['task_completed'], utc=True).apply(lambda x: x.date())
result = df.groupby(['id', 'category', 'site'])['task_completed'].agg(['max','min'])
result['diff'] = result['max']-result['min']
values = df.groupby(['id', 'category', 'site'])['task_completed'].value_counts()

But I'm not sure how to get the value counts and time differences together.

Thanks so much for your help.

CodePudding user response：

Try:

# convert the "Task Completed" column to datetime:
df["Task Completed"] = pd.to_datetime(df["Task Completed"], dayfirst=False)


x = df.groupby(["ID", "Category", "Site"], as_index=False).agg(
    **{
        "Time Difference": (
            "Task Completed",
            lambda x: (x.max() - x.min()).days,
        ),
        "# of instances": ("Task Completed", "count"),
    }
)

print(x)

Prints:

   ID Category Site  Time Difference  # of instances
0   1        A    X                2               4
1   1        B    Y                0               1
2   2        A    Z                1               2

CodePudding user response：

df = pd.DataFrame(
    [[1,'A','X','1/2/22 12:00:00AM'], 
    [1,'A','X','1/3/22 12:00:00AM'], 
    [1,'A','X','1/1/22 12:00:00AM'], 
    [1,'A','X','1/2/22 1:00:00AM'], 
    [1,'B','Y','1/1/22 1:00:00AM'],
    [2,'A','Z','1/2/22 12:00:00AM'],
    [2,'A','Z','1/1/22 12:00:00AM']], 
    columns=['ID', 'Category', 'Site', 'Task Completed'])

df['Task Completed'] = pd.to_datetime(df['Task Completed'])
df['Task Completed'] = df['Task Completed'].apply(lambda x: x.day_of_year)
df_res = df.groupby(['ID','Category','Site'], as_index=False)['Task Completed'].apply(lambda x : x.max()-x.min())
df_res.rename(columns = {'Task Completed':'Day Diff'}, inplace = True)
df_res

df_res returns

    ID Category Site Day Diff
0   1    A      X    2
1   1    B      Y    0
2   2    A      Z    1

CodePudding user response：

pandas' groupby is lazy; this means you can reuse it multiple times after creating it:

df["Task Completed"] = pd.to_datetime(df["Task Completed"], dayfirst=False)
out = df.groupby(['ID', 'Category', 'Site'])['task_completed']
(out
.agg(['size']) # use a list so that a DataFrame is returned
.assign(time_difference = out.max().sub(out.min()).dt.days)
.rename(columns={'size':'# of instances'})
) 
                  # of instances  time_difference
ID Category Site                                 
1  A        X                  4                2
   B        Y                  1                0
2  A        Z                  2                1