Most Cyber Attacks Are Encrypted Now. Training Data Wasn't.

Name: HIKARI-2021: Generating Network Intrusion Detection Dataset Based on Real and Encrypted Synthetic Attack Traffic
Creator: Ferriyan, A., Thamrin, A. H., Takeda, K.
Published: 2022
License: CC BY 4.0
Keywords: intrusion detection, network security, encrypted traffic, machine learning, cybersecurity, IDS

HIKARI-2021 bridges the critical gap between lab-grade intrusion detection datasets and the encrypted reality of modern network attacks.

Ferriyan, A., Thamrin, A. H., Takeda, K.|2022|↓ 21,219|View on Zenodo →

95%of web traffic now encrypted+12% since 2020

29Kresearcher views

8attack categories covered

2.4Mnetwork flow records

Why your firewall was trained on traffic that no longer exists

Here's the dirty secret of cybersecurity AI: most intrusion detection systems are trained on unencrypted network traffic. But in 2024, over 95% of web traffic is encrypted. The models powering your organization's defenses learned to spot threats in a world that no longer exists. HIKARI-2021 was built to fix this fundamental mismatch.

Researchers at a Japanese university captured real network traffic — the mundane hum of actual internet use — then layered in synthetic attacks using encrypted channels. The result is a dataset that mirrors what security systems actually face: threats hiding inside the same TLS tunnels as legitimate traffic. Brute force attacks, web exploits, and backdoor communications, all encrypted.

The dataset has been downloaded over 21,000 times by security teams and researchers across 80+ countries. Its impact extends beyond academia — commercial IDS vendors have used HIKARI-2021 to benchmark their detection engines against encrypted threats, revealing blind spots that older datasets simply couldn't expose.

Attack traffic distribution by category

Breakdown of malicious traffic types in the HIKARI-2021 dataset

Detection accuracy: encrypted vs unencrypted

ML model performance drops significantly when traffic is encrypted

Traditional IDS datasets like CICIDS and NSL-KDD use only unencrypted traffic, making them increasingly irrelevant to modern networks

Detection accuracy drops 12-18 percentage points when models trained on plaintext are tested against encrypted attacks

The dataset includes both real benign traffic and synthetic attacks, avoiding the artificial patterns that plague fully synthetic datasets

Ferriyan, A., Thamrin, A. H., Takeda, K.

dataset · 2022 · CC BY 4.0

intrusion detectionnetwork securityencrypted trafficmachine learningcybersecurityIDS

View on Zenodo →

🛡️

Defense Innovation

HIKARI-2021 enables a new generation of IDS models that can detect threats without decrypting traffic — preserving privacy while maintaining security. This is the direction enterprise security must move.

🎓

Research Impact

By providing a realistic encrypted traffic benchmark, the dataset allows fair comparison between detection approaches. Papers using HIKARI-2021 are producing more reproducible, real-world-relevant results.

⚖️

Privacy Balance

As regulations like GDPR make deep packet inspection legally fraught, security systems must learn to work with encrypted flows. This dataset trains models for that constrained reality.