I have a dataset that looks something like this:
df = pd.DataFrame(
[[1,'A','X','1/2/22 12:00:00AM', 'Alpha'],
[1,'A','X','1/3/22 12:00:00AM', 'Alpha'],
[1,'A','X','1/1/22 12:00:00AM', 'Beta'],
[1,'A','X','1/2/22 1:00:00AM', 'Gamma'],
[1,'B','Y','1/1/22 1:00:00AM', 'Alpha'],
[2,'A','Z','1/2/22 12:00:00AM', 'Alpha'],
[2,'A','Z','1/1/22 12:00:00AM', 'Alpha'],
[2,'A','Z','1/1/22 12:00:00AM', 'Beta'],
columns=['ID', 'Category', 'Site', 'Task Completed', 'Type'])
ID | Category | Site | Task Completed | Type |
---|---|---|---|---|
1 | A | X | 1/2/22 12:00:00AM | Alpha |
1 | A | X | 1/3/22 12:00:00AM | Alpha |
1 | A | X | 1/1/22 12:00:00AM | Beta |
1 | A | X | 1/2/22 1:00:00AM | Gamma |
1 | B | Y | 1/1/22 1:00:00AM | Alpha |
2 | A | Z | 1/2/22 12:00:00AM | Alpha |
2 | A | Z | 1/1/22 12:00:00AM | Alpha |
2 | A | Z | 1/1/22 12:00:00AM | Beta |
I want to find the Max - Min Task Completed date for all ID/Category/Site combos with type 'Alpha'. For the combo to be counted, they also need to have at least one other type besides 'Alpha'. I also want to count the instances of the 'Alphas' for the ID/Category/Site combos.
So, for this dataset, my intended result would look like this:
ID | Category | Site | Time Difference | # of instances |
---|---|---|---|---|
1 | A | X | 1 | 2 |
2 | A | Z | 1 | 2 |
I know how to get the counts and time difference if 'Type' is not considered:
# convert the "Task Completed" column to datetime:
df["Task Completed"] = pd.to_datetime(df["Task Completed"], dayfirst=False)
x = df.groupby(["ID", "Category", "Site"], as_index=False).agg(
**{
"Time Difference": (
"Task Completed",
lambda x: (x.max() - x.min()).days,
),
"# of instances": ("Task Completed", "count"),
}
)
print(x)
Which prints
ID Category Site Time Difference # of instances
0 1 A X 2 4
1 1 B Y 0 1
2 2 A Z 1 2
But I can't figure out how to consider 'type' as well.
Thank you for your help
CodePudding user response:
aaa = []
def my_func(x):
ind = x['Type'] == 'Alpha'
alf = x.loc[ind, 'Type'].count()
if alf >= 2 and len(x[x['Type'] != 'Alpha']) != 0:
difference = (x.loc[ind, 'Task Completed'].max() - x.loc[ind, 'Task Completed'].min()).days
bbb = x[ind][['ID', 'Category', 'Site']].values[0]
bbb = np.insert(bbb, 3, [difference, alf])
aaa.append(bbb)
df.groupby(["ID", "Category", "Site"], as_index=False).apply(my_func)
df1 = pd.DataFrame(aaa, columns=['ID', 'Category', 'Site', 'Time Difference', '# of instances'])
print(df1)
Output
ID Category Site Time Difference # of instances
0 1 A X 1 2
1 2 A Z 1 2
Made a function that I passed when grouping. The variable alf counts how many lines with 'Alpha' . If there are two or more rows with 'Alpha' and at least one row is not 'Alpha', then we read the data. The result is stored in the aaa list, which is inserted into a new dataframe.