Why I built this
Most signature-based IDS tools are reactive — they can only catch attacks they've already seen. I wanted to explore whether a purely statistical + ML approach could flag novel anomalies without any known signatures.
The approach
Aegis uses a two-layer detection strategy:
- Statistical baseline — rolling entropy, inter-arrival times, byte rate
- ML classifier — trained on labelled PCAP data using
scikit-learn
The idea is that the statistical layer catches obvious outliers cheaply, and the ML layer handles subtler patterns.
Processing the data
The biggest challenge was feature extraction from raw .pcap files. I used Scapy to parse packets and NumPy for sliding-window computation:
def extract_features(packets, window=100):
features = []
for i in range(len(packets) - window):
window_pkts = packets[i:i+window]
entropy = compute_entropy(window_pkts)
byte_rate = sum(len(p) for p in window_pkts) / window
features.append([entropy, byte_rate, ...])
return np.array(features)
Results
Tested against a 50MB PCAP dataset with injected anomalies:
| Method | Precision | Recall | F1 |
|---|---|---|---|
| Statistical only | 0.71 | 0.84 | 0.77 |
| ML only | 0.83 | 0.79 | 0.81 |
| Hybrid (Aegis) | 0.91 | 0.88 | 0.89 |
What I'd do differently
- Use a streaming architecture instead of batch processing
- Explore deep learning (autoencoders) for anomaly scoring
- Add a proper alert pipeline instead of just logging