How can update trained IsolationForest model with new datasets/datafarmes in python?-CodePudding

Let's say I fit IsolationForest() algorithm from scikit-learn on time-series based Dataset1 or dataframe1 df1 and save the model using the methods mentioned here & here. Now I want to update my model for new dataset2 or df2.

My findings:

this workaround about Incremental learning from sklearn:

...learn incrementally from a mini-batch of instances (sometimes called “online learning”) is key to out-of-core learning as it guarantees that at any given time, there will be only a small amount of instances in the main memory. Choosing a good size for the mini-batch that balances relevancy and memory footprint could involve tuning.

but Sadly IF algorithm doesn't support estimator.partial_fit(newdf)

auto-sklearn offers refit() is also not suitable for my case based on this post.

How I can update the trained on Dataset1 and saved IF model with a new Dataset2?

CodePudding user response：

You can simply reuse the .fit() call available to the estimator on the new data.

This would be preferred, especially in a time series, as the signal changes and you do not want older, non-representative data to be understood as potentially normal (or anomalous).

If old data is important, you can simply join the older training data and newer input signal data together, and then call .fit() again.

Also sidenote, according to sklearn documentation, it is better to use joblib than pickle

An MRE with resources below:

# Model
from sklearn.ensemble import IsolationForest

# Saving file
import joblib

# Data
import numpy as np

# Create a new model
model = IsolationForest()

# Generate some old data
df1 = np.random.randint(1,100,(100,10))
# Train the model
model.fit(df1)

# Save it off
joblib.dump(model, 'isf_model.joblib')

# Load the model
model = joblib.load('isf_model.joblib')

# Generate new data
df2 = np.random.randint(1,500,(1000,10))

# If the original data is now not important, I can just call .fit() again.
# If you are using time-series based data, this is preferred, as older data may not be representative of the current state
model.fit(df2)

# If the original data is important, I can simply join the old data to new data. There are multiple options for this:
# Pandas: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
# Numpy: https://numpy.org/doc/stable/reference/generated/numpy.concatenate.html

combined_data = np.concatenate((df1, df2))
model.fit(combined_data)