Building Aegis: an anomaly-based IDS from scratch

Why I built this

Most signature-based IDS tools are reactive — they can only catch attacks they've already seen. I wanted to explore whether a purely statistical + ML approach could flag novel anomalies without any known signatures.

The approach

Aegis uses a two-layer detection strategy:

Statistical baseline — rolling entropy, inter-arrival times, byte rate
ML classifier — trained on labelled PCAP data using scikit-learn

The idea is that the statistical layer catches obvious outliers cheaply, and the ML layer handles subtler patterns.

Processing the data

The biggest challenge was feature extraction from raw .pcap files. I used Scapy to parse packets and NumPy for sliding-window computation:

def extract_features(packets, window=100):
    features = []
    for i in range(len(packets) - window):
        window_pkts = packets[i:i+window]
        entropy = compute_entropy(window_pkts)
        byte_rate = sum(len(p) for p in window_pkts) / window
        features.append([entropy, byte_rate, ...])
    return np.array(features)

Results

Tested against a 50MB PCAP dataset with injected anomalies:

Method	Precision	Recall	F1
Statistical only	0.71	0.84	0.77
ML only	0.83	0.79	0.81
Hybrid (Aegis)	0.91	0.88	0.89

What I'd do differently

Use a streaming architecture instead of batch processing
Explore deep learning (autoencoders) for anomaly scoring
Add a proper alert pipeline instead of just logging