Teaching Machines to Tell Malware Families Apart

Name: A malware dataset
Creator: Catak, F. O., Yazi, A. F.
Published: 2019
License: CC BY 4.0
Keywords: malware, classification, machine learning, cybersecurity, threat detection

A curated collection of malware behavioral profiles that became the training ground for thousands of threat detection models worldwide.

Catak, F. O., Yazi, A. F. · 2019DOI: 10.5281/zenodo.3293593CC BY 4.0View on Zenodo →

Trojan

Ransomware

Backdoor

Worm

Spyware

Adware

Dropper

Virus

Rootkit

What a virus does matters more than what it looks like

Traditional antivirus software works like a wanted poster — it matches files against known signatures. But modern malware mutates faster than signatures can be written. The solution: watch what programs do, not what they look like. This dataset captures the behavioral DNA of malware — the API calls, system interactions, and execution patterns that betray malicious intent.

Researchers at two Turkish universities built something deceptively simple: they ran thousands of malware samples in sandboxed environments, recorded every system call each one made, and organized the results by malware family. Trojans behave differently from ransomware. Worms act differently from spyware. These behavioral fingerprints are what machine learning models need to learn threat classification.

With nearly 14,000 downloads, the dataset has become a standard benchmark in malware classification research. It bridges a critical gap — most organizations can't legally share malware samples, so open datasets like this one are the only way independent researchers can train and compare detection models.

Samples per malware family

Distribution of behavioral profiles across 9 malware categories

We stopped looking at what malware looks like and started watching what it does. That changed everything.

Behavioral analysis catches polymorphic malware that signature-based detection misses entirely

Trojans account for nearly a quarter of all samples, reflecting their dominance in real-world threats

Open malware datasets are rare — legal and ethical barriers prevent most organizations from sharing samples publicly

🔒

Enterprise Security

Models trained on this dataset can classify unknown threats by behavior alone — no signature updates needed. This shifts the economics of defense from reactive to proactive.

🎓

Academic Access

Most malware research requires access to live samples, which universities can't easily obtain. This dataset democratizes threat research, enabling work that was previously restricted to industry labs.

🌐

Global Threat Landscape

As malware-as-a-service lowers the barrier to creating threats, behavioral classification becomes essential. The families in this dataset represent the building blocks of most modern cyberattacks.