i have some problem and i hope you can help me thanks!!!!
i have a table looks like that:
Computer | Data | Count |
---|---|---|
A | 01/01/2021 | 43 |
A | 02/01/2021 | 64 |
A | 03/01/2021 | 333 |
A | 04/01/2021 | 656 |
B | 01/01/2021 | 41 |
B | 02/01/2021 | 436 |
B | 03/01/2021 | 745 |
B | 04/01/2021 | 234 |
I would like to run isolation forest algorithm only on part of the table
i don't what to do it manually like df[df['Computer'] == A]['Count'] for every Computer there are like 500 different Computers. so i don't what to do this:
scaler = StandardScaler()
np_scaled = scaler.fit_transform(df[df['Computer'] == A]['Count'].values.reshape(-1, 1))
data = pd.DataFrame(np_scaled)
# train isolation forest
model = IsolationForest(contamination=float(.01))
model.fit(data)
df['anomaly'] = model.predict(data)
500 times (for A and B and C and More) there is way to do it Efficiently thanks!!!
As a result, it should look like this but every time its check anomaly only for A separately, B separately and so on
Computer | Data | Count | anomaly |
---|---|---|---|
A | 01/01/2021 | 43 | 1 |
A | 02/01/2021 | 64 | 1 |
A | 03/01/2021 | 333 | 1 |
A | 04/01/2021 | 656 | -1 |
B | 01/01/2021 | 41 | 1 |
B | 02/01/2021 | 436 | 1 |
B | 03/01/2021 | 745 | 1 |
B | 04/01/2021 | 234 | 1 |
CodePudding user response:
You could group by Computer
and use transform
to execute the function you already have over each group returning the same indexes as the original to the anomaly
column.
def train_isolation_group(group_count):
scaler = StandardScaler()
np_scaled = scaler.fit_transform(group_count.values.reshape(-1, 1))
data = pd.DataFrame(np_scaled)
# train isolation forest
model = IsolationForest(contamination=float(.01))
model.fit(data)
return model.predict(data)
df['anomaly'] = df.groupby('Computer')['Count'].transform(train_isolation_group)