ML4LM — Mastering Anomaly Detection in Production: When to use and when not to use

Hoyath
3 min readNov 7, 2024

--

Problem: Detecting Anomalies in Data

In the world of data analysis, finding anomalies or points that significantly differ from the rest of the data can be crucial. These outliers might indicate fraud in financial data, faults in manufacturing, or simply unexpected behavior in a system. The task is to identify these unusual points effectively and efficiently.

To understand this, imagine a group of friends with ages: 20, 27, 29, 23, 67, 22, 29. If you wanted to split them into groups by age, one person (67) stands out as unusual. You’d likely isolate this friend sooner than the others by using a few simple splits based on age.

Solution Idea: Random Decision Splits

Isolation Forests work by isolating anomalies quickly through a series of random decisions. Each “decision” is a split that divides the data based on randomly selected feature values. Let’s see how it would work with the group of friends example:

1. First, split by an age threshold of <25>. This creates two groups: those younger than 25 (20, 23, 22) and those older (27, 29, 67, 29).
2. Next, split by <35> in the older group, and we see that 67 gets isolated immediately.

This approach generalizes across the dataset by constructing many random trees with different splits. Anomalies, being few and distinct, tend to get isolated faster than normal points, leading to shorter paths on average in the trees.

The Methodology: Isolation Forest

To implement this, we create a forest of random trees. Each tree randomly splits features to create partitions until each data point is isolated. Points that reach isolation quickly, having shorter paths on average across the forest, are flagged as anomalies. In practice, if a new data point arrives, we can quickly determine whether it’s anomalous by running it through the forest and measuring its average path length.

Setting the Threshold: Deciding the threshold for anomaly detection is crucial and is typically set based on the expected proportion of anomalies. For instance, if you expect 10% of data points to be anomalous, you can set the threshold to reflect that.

Why Use Isolation Forest?

Isolation Forests offer a unique blend of benefits that make them effective in certain scenarios:

No Labels Required: Since Isolation Forests work by splitting based on feature distribution rather than classifying known labels, they are ideal for unsupervised anomaly detection.
No Specific Feature Dependency: If you don’t have well-defined features to identify outliers, Isolation Forests can still work effectively by leveraging random splits.
Speed: Isolation Forests are relatively fast in both training and inference, as they don’t require complex computations.

When Not to Use Isolation Forest

Isolation Forests have limitations and may not be the best choice in some situations:

Highly Correlated Features: When features are strongly correlated, random splits may not effectively isolate outliers, leading to reduced accuracy.
Rare Anomalies: If anomalies are extremely rare (like 0.01% of data), Isolation Forest may struggle to find them without more aggressive settings.
Continuous Training Requirement: Isolation Forests work best in a batch setting, not a streaming one, due to the need for re-building trees as new data arrives.
Numerous Outliers: When there are many outliers, Isolation Forest might not be able to isolate each one distinctly.

Online Training: Why It’s Challenging in Isolation Forests

Unlike clustering methods like DBSCAN, Isolation Forests are not ideal for online learning. Each new data point might require parts of the trees to be rebuilt, which can be slow and computationally expensive, especially with large datasets. Additionally, Isolation Forests struggle with concept drift (a change in the data distribution over time), as this often necessitates re-building the entire forest.

In contrast, DBSCAN, for example, can handle online data better because each new point can form a new cluster or join an existing one without needing to update a static model.

Summary

Isolation Forests provide a simple yet effective way to detect anomalies without needing labeled data or complex feature engineering. However, they work best in stable data environments where retraining isn’t frequent, and they may struggle with highly correlated features or extremely rare anomalies. If continuous or online anomaly detection is required, consider DBSCAN or similar methods, which adapt more flexibly to new data.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Hoyath
Hoyath

Written by Hoyath

Masters in Computer Science, University of Riverside, California. Ex- Analyst at Goldman Sachs

No responses yet

Write a response