Filter groupby by combos with either category types-CodePudding

I have a dataset that looks something like this:

df = pd.DataFrame(
[[1,'A','X','1/2/22 12:00:00AM', 'Alpha'], 
[1,'A','X','1/3/22 12:00:00AM', 'Alpha'], 
[1,'A','X','1/1/22 12:00:00AM', 'Beta'], 
[1,'A','X','1/2/22 1:00:00AM', 'Gamma'], 
[1,'B','Y','1/1/22 1:00:00AM', 'Alpha'],
[2,'A','Z','1/2/22 12:00:00AM', 'Alpha'],
[2,'A','Z','1/1/22 12:00:00AM', 'Alpha'], 
[2,'A','Z','1/1/22 12:00:00AM', 'Beta'], 
columns=['ID', 'Category', 'Site', 'Task Completed', 'Type'])

ID	Category	Site	Task Completed	Type
1	A	X	1/2/22 12:00:00AM	Alpha
1	A	X	1/3/22 12:00:00AM	Alpha
1	A	X	1/1/22 12:00:00AM	Beta
1	A	X	1/2/22 1:00:00AM	Gamma
1	B	Y	1/1/22 1:00:00AM	Alpha
2	A	Z	1/2/22 12:00:00AM	Alpha
2	A	Z	1/1/22 12:00:00AM	Alpha
2	A	Z	1/1/22 12:00:00AM	Beta

I want to find the Max - Min Task Completed date for all ID/Category/Site combos with type 'Alpha'. For the combo to be counted, they also need to have at least one other type besides 'Alpha'. I also want to count the instances of the 'Alphas' for the ID/Category/Site combos.

So, for this dataset, my intended result would look like this:

ID	Category	Site	Time Difference	# of instances
1	A	X	1	2
2	A	Z	1	2

I know how to get the counts and time difference if 'Type' is not considered:

# convert the "Task Completed" column to datetime:
df["Task Completed"] = pd.to_datetime(df["Task Completed"], dayfirst=False)


x = df.groupby(["ID", "Category", "Site"], as_index=False).agg(
    **{
        "Time Difference": (
            "Task Completed",
            lambda x: (x.max() - x.min()).days,
        ),
        "# of instances": ("Task Completed", "count"),
    }
)

print(x)

Which prints

   ID Category Site  Time Difference  # of instances
0   1        A    X                2               4
1   1        B    Y                0               1
2   2        A    Z                1               2

But I can't figure out how to consider 'type' as well.

Thank you for your help

CodePudding user response：

aaa = []

def my_func(x):
    ind = x['Type'] == 'Alpha'
    alf = x.loc[ind, 'Type'].count()
    if alf >= 2 and len(x[x['Type'] != 'Alpha']) != 0:
        difference = (x.loc[ind, 'Task Completed'].max() - x.loc[ind, 'Task Completed'].min()).days
        bbb = x[ind][['ID', 'Category', 'Site']].values[0]
        bbb = np.insert(bbb, 3, [difference, alf])
        aaa.append(bbb)



df.groupby(["ID", "Category", "Site"], as_index=False).apply(my_func)

df1 = pd.DataFrame(aaa, columns=['ID', 'Category', 'Site', 'Time Difference', '# of instances'])

print(df1)

Output

   ID Category Site  Time Difference  # of instances
0   1        A    X                1               2
1   2        A    Z                1               2

Made a function that I passed when grouping. The variable alf counts how many lines with 'Alpha' . If there are two or more rows with 'Alpha' and at least one row is not 'Alpha', then we read the data. The result is stored in the aaa list, which is inserted into a new dataframe.