SynthCTI: LLM-Driven Synthetic CTI Generation to enhance MITRE Technique Mapping

Álvaro Ruiz-Ródenas, Jaime Pujante Sáez, Daniel García-Algora, Mario Rodríguez Béjar, Jorge Blasco, Jose Luis Hernandez-Ramos

enero 2025

Resumen

Cyber Threat Intelligence (CTI) mining extracts structured insights from unstructured threat data, enabling organizations to understand and respond to evolving adversarial behavior. A key task is mapping threat descriptions to MITRE ATT&CK techniques. However, this is often performed manually, requiring expert knowledge and substantial effort. Automated approaches face two major challenges: scarcity of high-quality labeled CTI data and class imbalance. While domain-specific Large Language Models (LLMs) such as SecureBERT have improved performance, most recent work focuses on model architecture rather than data limitations. We hypothesize that semantically guided synthetic CTI generation can mitigate such limitations. To test this hypothesis, we present SynthCTI, a data augmentation framework designed to generate high-quality synthetic CTI sentences for underrepresented MITRE ATT&CK techniques. The methodology converts CTI sentences into semantic vector representations, clusters them with HDBSCAN to identify semantically coherent groups, and extracts contextual features to construct prompts. These prompts guide an LLM to produce diverse and semantically faithful synthetic CTI sentences. We evaluate SynthCTI on two publicly available CTI datasets, CTI-to-MITRE and TRAM, using models with different capacities. Incorporating synthetic data leads to consistent macro-F1 improvements: for example, ALBERT improves from 0.35 to 0.52 (48.6% relative gain), and SecureBERT from 0.4412 to 0.6558. Smaller models augmented with SynthCTI outperform larger models trained without augmentation, demonstrating the value of data generation for CTI classification. These results confirm our hypothesis and highlight the practicality of enabling smaller organizations to adopt advanced CTI analytics without requiring high-performance computing infrastructure.

Tipo

Artículo de revista

Publicación

Future Generation Computer Systems, PP. 108232, DOI: https://doi.org/10.1016/j.future.2025.108232