Efficiently calculate anomaly detection-CodePudding

i have some problem and i hope you can help me thanks!!!!

i have a table looks like that:

Computer	Data	Count
A	01/01/2021	43
A	02/01/2021	64
A	03/01/2021	333
A	04/01/2021	656
B	01/01/2021	41
B	02/01/2021	436
B	03/01/2021	745
B	04/01/2021	234

I would like to run isolation forest algorithm only on part of the table

i don't what to do it manually like df[df['Computer'] == A]['Count'] for every Computer there are like 500 different Computers. so i don't what to do this:

scaler = StandardScaler()
np_scaled = scaler.fit_transform(df[df['Computer'] == A]['Count'].values.reshape(-1, 1))
data = pd.DataFrame(np_scaled)

# train isolation forest
model =  IsolationForest(contamination=float(.01))
model.fit(data)
df['anomaly'] = model.predict(data)

500 times (for A and B and C and More) there is way to do it Efficiently thanks!!!

As a result, it should look like this but every time its check anomaly only for A separately, B separately and so on

Computer	Data	Count	anomaly
A	01/01/2021	43	1
A	02/01/2021	64	1
A	03/01/2021	333	1
A	04/01/2021	656	-1
B	01/01/2021	41	1
B	02/01/2021	436	1
B	03/01/2021	745	1
B	04/01/2021	234	1

CodePudding user response：

You could group by Computer and use transform to execute the function you already have over each group returning the same indexes as the original to the anomaly column.

def train_isolation_group(group_count):
    scaler = StandardScaler()
    np_scaled = scaler.fit_transform(group_count.values.reshape(-1, 1))
    data = pd.DataFrame(np_scaled)

    # train isolation forest
    model =  IsolationForest(contamination=float(.01))
    model.fit(data)
    return model.predict(data)

df['anomaly'] = df.groupby('Computer')['Count'].transform(train_isolation_group)