Home > Blockchain >  Filter groupby by combos with either category types
Filter groupby by combos with either category types

Time:10-25

I have a dataset that looks something like this:

df = pd.DataFrame(
[[1,'A','X','1/2/22 12:00:00AM', 'Alpha'], 
[1,'A','X','1/3/22 12:00:00AM', 'Alpha'], 
[1,'A','X','1/1/22 12:00:00AM', 'Beta'], 
[1,'A','X','1/2/22 1:00:00AM', 'Gamma'], 
[1,'B','Y','1/1/22 1:00:00AM', 'Alpha'],
[2,'A','Z','1/2/22 12:00:00AM', 'Alpha'],
[2,'A','Z','1/1/22 12:00:00AM', 'Alpha'], 
[2,'A','Z','1/1/22 12:00:00AM', 'Beta'], 
columns=['ID', 'Category', 'Site', 'Task Completed', 'Type'])
ID Category Site Task Completed Type
1 A X 1/2/22 12:00:00AM Alpha
1 A X 1/3/22 12:00:00AM Alpha
1 A X 1/1/22 12:00:00AM Beta
1 A X 1/2/22 1:00:00AM Gamma
1 B Y 1/1/22 1:00:00AM Alpha
2 A Z 1/2/22 12:00:00AM Alpha
2 A Z 1/1/22 12:00:00AM Alpha
2 A Z 1/1/22 12:00:00AM Beta

I want to find the Max - Min Task Completed date for all ID/Category/Site combos with type 'Alpha'. For the combo to be counted, they also need to have at least one other type besides 'Alpha'. I also want to count the instances of the 'Alphas' for the ID/Category/Site combos.

So, for this dataset, my intended result would look like this:

ID Category Site Time Difference # of instances
1 A X 1 2
2 A Z 1 2

I know how to get the counts and time difference if 'Type' is not considered:

# convert the "Task Completed" column to datetime:
df["Task Completed"] = pd.to_datetime(df["Task Completed"], dayfirst=False)


x = df.groupby(["ID", "Category", "Site"], as_index=False).agg(
    **{
        "Time Difference": (
            "Task Completed",
            lambda x: (x.max() - x.min()).days,
        ),
        "# of instances": ("Task Completed", "count"),
    }
)

print(x)

Which prints

   ID Category Site  Time Difference  # of instances
0   1        A    X                2               4
1   1        B    Y                0               1
2   2        A    Z                1               2

But I can't figure out how to consider 'type' as well.

Thank you for your help

CodePudding user response:

aaa = []

def my_func(x):
    ind = x['Type'] == 'Alpha'
    alf = x.loc[ind, 'Type'].count()
    if alf >= 2 and len(x[x['Type'] != 'Alpha']) != 0:
        difference = (x.loc[ind, 'Task Completed'].max() - x.loc[ind, 'Task Completed'].min()).days
        bbb = x[ind][['ID', 'Category', 'Site']].values[0]
        bbb = np.insert(bbb, 3, [difference, alf])
        aaa.append(bbb)



df.groupby(["ID", "Category", "Site"], as_index=False).apply(my_func)

df1 = pd.DataFrame(aaa, columns=['ID', 'Category', 'Site', 'Time Difference', '# of instances'])

print(df1)

Output

   ID Category Site  Time Difference  # of instances
0   1        A    X                1               2
1   2        A    Z                1               2

Made a function that I passed when grouping. The variable alf counts how many lines with 'Alpha' . If there are two or more rows with 'Alpha' and at least one row is not 'Alpha', then we read the data. The result is stored in the aaa list, which is inserted into a new dataframe.

  • Related