DOI: 10.17587/prin.15.555-569
Analysis of Datasets and Large Language Models for Vulnerability Detection in Imperative Programming Language Code
V. V. Shvyrov, Associate Professor, slshj@yandex.ru,
D. A. Kapustin, Associate Professor, kap-kapchik@mail.ru,
R. N. Sentyay, Senior Lecturer, sentyayroman@yandex.ru,
T. I. Shulika, Assistant, shulika-tatyana@mail.ru,
Lugansk State Pedagogical University, Lugansk, 91011, Russian Federation
Corresponding author: Denis A. Kapustin, Associate Professor, Lugansk State Pedagogical University, Lugansk, 91011, Russian Federation, E-mail: kap-kapchik@mail.ru
Received on August 27, 2024
Accepted on September 24, 2024
Large language models are machine learning models that enable the classification and generation of both natural language text and code in various programming languages. These models have billions of parameters and are trained on vast datasets. In recent years, such models have been successfully applied to a wide range of tasks in software engineering. The paper presents data on publication activity on the topic under study, which is obtained on the basis of statistical analysis of search results for relevant key queries. In addition, a review of recent publications in the field of using large language models to detect vulnerabilities in program code is carried out, and the results of the analysis of data sets that are used in training neural network models to search for vulnerabilities in program code are presented.
Keywords: CWE, Large Language Models, Programming Languages, Static Analysis, Vulnerability Dataset, Vulnerability Detection
pp. 555—569
For citation:
Shvyrov V. V., Kapustin D. A., Sentyay R. N., Shulika T. I. Analysis of Datasets and Large Language Models for Vulnerability Detection in Imperative Programming Language Code, Programmnaya Ingeneria, 2024, vol. 15, no. 11, pp. 555—569. DOI: 10.17587/prin.15.555-569. (in Russian).
References:
- Russell R., Kim L., Hamilton L. et al. Automated vulnerability detection in source code using deep representation learning, Proceedings of the 17th IEEE International Conference on Machine Learning and Applications (ICMLA 2018). IEEE, 2018, pp. 757—762. DOI: 10.1109/ICMLA.2018.00120.
- Zhou Y., Liu S., Siow J. et al. Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks. 2019. arXiv:1909.03496.
- Li Z., Zou D., Xu S. et al. SySeVR: A Framework for Using Deep Learning to Detect Software Vulnerabilities, IEEE Transactions on Dependable and Secure Computing, 2018, vol. 189, no. 4, pp. 2244—2258. DOI: 10.1109/TDSC.2021.3051525.
- Bhandari G. P., Naseer A., Moonen L. CVEfixes: automated collection of vulnerabilities and their fixes from open-source software, Proceedings of the 17th International Conference on Predictive Models and Data Analytics in Software Engineering, 2021, pp. 30—39. DOI: 10.1145/3475960.3475985.
- Chakraborty S., Krishna R., Ding Y., Ray B. Deep Learning Based Vulnerability Detection: Are We There Yet? IEEE Transactions on Software Engineering, 2022, vol. 48, no. 09. pp. 3280—3296. DOI: 10.1109/TSE.2021.3087402.
- Sun Y., Wong A. K., Kamel M. S. Classification of Imbalanced Data: a Review, International journal of pattern recognition and artificial intelligence, 2009, vol. 23, no. 04, pp. 687—719. DOI: 10.1142/S0218001409007326.
- Cheshkov A., Zadorozhny P. A., Levichev R. Evaluation of ChatGPT Model for Vulnerability Detection. ArXiv, abs/2304.07232, 2023. DOI: 10.48550/arXiv.2304.07232.
- Fu M., Tantithamthavorn C. K., Nguyen V., Le T. ChatGPT for Vulnerability Detection, Classification, and Repair: How Far Are We? 2023 30th Asia-Pacific Software Engineering Conference (AP-SEC), 2023, pp. 632—636. DOI: 10.1109/APSEC60848.2023.00085.
- Fan J., Li Y., Wang S., Nguyen T. N. A C/C++ Code Vulnerability Dataset with Code Changes and CVE Summaries, Proceedings of the 17th International Conference on Mining Software Repositories (MSR), 2020, pp. 508—512. DOI: 10.1145/3379597.3387501.
- Fu M., Tantithamthavorn C. K., Le T. et al. AIBugHunter: A Practical tool for predicting, classifying and repairing software vulnerabilities, Empirical Software Engineering, 2023, vol. 29, article 4, pp. 1—33. DOI: 10.1007/s10664-023-10346-3.
- Fu M., Nguyen V., Tantithamthavorn C. K. et al. VulExplainer: A Transformer-Based Hierarchical Distillation for Explaining Vulnerability Types, IEEE Transactions on Software Engineering, 2023, vol. 49, no. 10, pp. 4550—4565. DOI: 10.1109/TSE.2023.3305244.
- Vaswani A., Shazeer N. M., Parmar N., Uszkoreit J. et al. Attention is All you Need, Neural Information Processing Systems, 2017, pp. 5998—6008.
- Feng Z., Guo D., Tang D., Duan N. et al. CodeBERT: A Pre-Trained Model for Programming and Natural Languages, Findings of the Association for Computational Linguistics: EMNLP, 2020, pp. 1536—1547. DOI: 10.18653/v1/2020.findings-emnlp.139.
- Fu M., Tantithamthavorn C. K. LineVul: A Transformerbased Line-Level Vulnerability Prediction, 2022 IEEE/ACM 19th International Conference on Mining Software Repositories (MSR), 2022, pp. 608—620. DOI: 10.1145/3524842.3528452.
- CWE, available at: https://cwe.mitre.org/ (date of access 21.08.2024).
- Hugging Face. The AI community building the future, available at: https://huggingface.co/ (date of access 21.08.2024).
- The latest in Machine Learning. Papers With Code, available at: https://paperswithcode.com/ (date of access 21.08.2024).
- CVE, available at: https://cve.mitre.org/ (date of access 21.08.2024).
- Lin G., Wen S., Han Q. et al. Software Vulnerability Detection Using Deep Neural Networks: A Survey, Proceedings of the IEEE, 2020, vol. 108, no. 10, pp. 1825—1848. DOI: 10.1109/JPROC.2020.2993293.
- Wu B., Zou F. Code Vulnerability Detection Based on Deep Sequence and Graph Models: A Survey, Security and Communication Networks, 2022. DOI: 10.1155/2022/1176898.
- Hou X., Zhao Y., Liu Y. et al. Large Language Models for Software Engineering: A Systematic Literature Review, 2023. ArXiv, abs/2308.10620. DOI: 10.48550/arXiv.2308.10620.
- Chan A., Kharkar A., Moghaddam R. Z., Mohylevskyy Y. et al. Transformer-based Vulnerability Detection in Code at EditTime: Zero-shot, Few-shot, or Fine-tuning? 2023. ArXiv, abs/2306.01754, DOI: 10.48550/arXiv.2306.01754.
- Thapa C., Jang S. I., Ahmed M. E. et al. Transformer-Based Language Models for Software Vulnerability Detection, Proceedings of the 38th Annual Computer Security Applications Conference, 2022, pp. 481—496. DOI: 10.1145/3564625.3567985.
- Zou D., Wang S., Xu S., Li Z., Jin H. VulDeePecker: A Deep Learning-Based System for Multiclass Vulnerability Detection, IEEE Transactions on Dependable and Secure Computing, 2022, vol. 18, no. 5, pp. 2224—2236. DOI: 10.1109/TDSC.2019.2942930.
- Zhou X., Cao S., Sun X., Lo D. Large Language Model for Vulnerability Detection and Repair: Literature Review and the Road Ahead, 2024. ArXiv, abs/2404.02525. DOI: 10.48550/arX-iv.2404.02525.
- Hanif H., Maffeis S. VulBERTa: Simplified Source Code Pre-Training for Vulnerability Detection, 2022 International Joint Conference on Neural Networks (IJCNN), 2022, pp. 1—8. DOI: 10.1109/IJCNN55064.2022.9892280.
- Lu S., Guo D., Ren S., Huang J. et al. CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation, 2021. ArXiv, 021.abs/2102.04664.
- Purba M. D., Ghosh A., Radford B. J., Chu B. Software Vulnerability Detection using Large Language Models, 2023 IEEE 34th International Symposium on Software Reliability Engineering Workshops (ISSREW), 2023, pp. 112—119. DOI: 10.1109/ISSREW60843.2023.00058.
- Li R., Allal L. B., Zi Y. et al. StarCoder: may the source be with you! 2023. ArXiv, abs/2305.06161. DOI: 10.48550/arX-iv.2305.06161.
- Luo Z., Xu C., Zhao P., Sun Q. et al. WizardCoder: Empowering Code Large Language Models with Evol-Instruct, 2023. ArXiv, abs/2306.08568. DOI: 10.48550/arXiv.2306.08568.
- Shestov A., Cheshkov A., Levichev R., Mussabayev R. et al. Finetuning Large Language Models for Vulnerability Detection, 2024. ArXiv, abs/2401.17010. DOI: 10.48550/arXiv.2401.17010.
- Steenhoek B., Rahman M. M., Roy M. K. et al. A Comprehensive Study of the Capabilities of Large Language Models for Vulnerability Detection, 2024. ArXiv, abs/2403.17218. DOI: 10.48550/arXiv.2403.17218.
- Casey B., Santos J. C., Perry G. A Survey of Source Code Representations for Machine Learning-Based Cybersecurity Tasks, 2024. ArXiv, abs/2403.10646. DOI: 10.48550/arXiv.2403.10646.
- Mester A., Bodo Z. Malware Classification Based on Graph Convolutional Neural Networks and Static Call Graph Features, International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems, 2022, pp. 528—539. DOI: 10.1007/978-3-031-08530-7_45.
- Cao S., Sun X., Bo L., Wei Y., Li B. BGNN4VD: Constructing Bidirectional Graph Neural-Network for Vulnerability Detection. Information and Software Technology, 2021, vol. 136, article 106576. DOI: 10.1016/J.INFSOF.2021.106576.
- Siow J., Liu S., Xie X., Meng G., Liu Y. Learning Program Semantics with Code Representations: An Empirical Study, 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), 2022, pp. 554—565. DOI: 10.48550/ arXiv.2203.11790.
- Renjith G., Aji S. Vulnerability Analysis and Detection Using Graph Neural Networks for Android Operating System, International Conferences on Information Science and System, 2021, pp. 57—72. DOI: 10.1007/978-3-030-92571-0_4.
- Wu H., Zhang Z., Wang S. et al. Peculiar: Smart Contract Vulnerability Detection Based on Crucial Data Flow Graph and Pre-training Techniques, 2021 IEEE 32nd International Symposium on Software Reliability Engineering (ISSRE), 2021, pp. 378—389. DOI: 10.1109/ISSRE52982.2021.00047.
- Zhou L., Huang M., Li Y. et al. GraphEye: A Novel Solution for Detecting Vulnerable Functions Based on Graph Attention Network, 2021 IEEE Sixth International Conference on Data Science in Cyberspace (DSC), 2021, pp. 381—388. DOI: 10.1109/dsc53577.2021.00060.
- Partenza G., Amburgey T., Deng L. et al. Automatic Identification of Vulnerable Code: Investigations with an AST-Based Neural Network, 2021 IEEE 45th Annual Computers, Software, and Applications Conference (COMPSAC), 2021, pp. 1475—1482. DOI: 10.1109/COMPSAC51774.2021.00219.
- Velickovic P., Cucurull G., Casanova A. et al. Graph Attention Networks, 2018. ArXiv, abs/1710.10903. DOI:10.17863/CAM.48429.
- Do H., Elbaum S. G., Rothermel G. Supporting Controlled Experimentation with Testing Techniques: An Infrastructure and its Potential Impact, Empirical Software Engineering, 2005, vol. 10, no. 4, pp. 405—435. DOI: 10.1007/s10664-005-3861-2.
- Okun V., Delaitre A., Black P. E. Report on the Static Analysis Tool Exposition (SATE) IV, NIST Special Publication, 2013, vol. 500. DOI:10.6028/NIST.SP.500-297.
- OWASP. Benchmark Project, available at: https://owasp.org/www-project-benchmark/ (date of access 21.08.2024).
- Software Assurance Reference Dataset, available at: https://samate.nist.gov/SARD/ (date of access 21.08.2024).
- Wang C., Li Z., Peng Y., Gao S. et al. REEF: A Framework for Collecting Real-World Vulnerabilities and Fixes, 2023 38th IEEE/ ACM International Conference on Automated Software Engineering (ASE), 2023, pp. 1952—1962. DOI: 10.1109/ASE56229.2023.00199.
- Siddiq M., Santos J. SecurityEval dataset: mining vulnerability examples to evaluate machine learning-based code generation techniques, Proceedings of the 1st International Workshop on Mining Software Repositories Applications for Privacy and Security, 2022, pp. 29—33. DOI: 10.1145/3549035.3561184.
- Chen Y., Ding Z., Alowain L. et al. DiverseVul: A New Vulnerable Source Code Dataset for Deep Learning Based Vulnerability Detection, Proceedings of the 26th International Symposium on Research in Attacks, Intrusions and Defenses, 2023, pp. 654—668. DOI: 10.1145/3607199.3607242.
- Nikitopoulos G., Dritsa K., Louridas P., Mitropoulos D. CrossVul: a cross-language vulnerability dataset with commit data, Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2021, pp. 1565—1569. DOI: 10.1145/3468264.3473122.
- Ding Y., Fu Y., Ibrahim O. et al. Vulnerability Detection with Code Language Models: How Far Are We? 2024. ArXiv, abs/2403.18624, DOI: 10.48550/arXiv.2403.18624.
- Choi M., Jeong S., Oh H., Choo J. End-to-End Prediction of Buffer Overruns from Raw Source Code via Neural Memory Networks, 2017. ArXiv, abs/1703.02458. DOI: 10.24963/ijcai.2017/214.
- Wartschinski L., Noller Y., Vogel T. et al. VUDENC: Vulnerability Detection with Deep Learning on a Natural Codebase for Python, 2022. ArXiv, abs/2201.08441. DOI: 10.1016/j.infsof.2021.106809.
- Reis S., Abreu R. A ground-truth dataset of real security patches, 2021. ArXiv, abs/2110.09635.
- Ni C., Shen L., Yang X. et al. MegaVul: A C/C++ Vulnerability Dataset with Comprehensive Code Representations, 2024 IEEE/ACM 21st International Conference on Mining Software Repositories (MSR), 2024, pp. 738—742. DOI: 10.1145/3643991.3644886.
- Github Inc. 2021. CodeQL for research, available at: https://securitylab.github.com/tools/codeql (date of access 21.08.2024).