MACHINE LEARNING APPROACHES TO CODE SIMILARITY MEASUREMENT A SYSTEMATIC REVIEW

Mrs. Deepthi Repalle; Aswini; Sindhuja; Poojitha

doi:10.64751/ajmimc.2025.v4.n4(1).pp15-21

Authors

Mrs. Deepthi Repalle Author
Aswini Author
Sindhuja Author
Poojitha Author

DOI:

https://doi.org/10.64751/ajmimc.2025.v4.n4(1).pp15-21

Keywords:

Machine learning, code similarity measurement, systematic review, source code analysis, code clone detection, plagiarism detection, code quality assurance, abstract syntax tree (AST), deep learning, hybrid models, code representation, software engineering, code recommendation, malware detection, vulnerability analysis, dataset benchmarking, BigCloneBench, neural networks, Siamese networks, crosslanguage code similarity, scalability, code review automation, performance evaluation, software maintenance

Abstract

The systematic review on "Machine Learning Approaches to Code Similarity Measurement" provides a comprehensive analysis of how machine learning (ML) techniques have been applied to assess the degree of resemblance between code segments. Code similarity measurement is crucial in various software engineering tasks such as code quality assurance, plagiarism detection, vulnerability analysis, and malware detection. This review surveyed 84 primary studies, identifying 51 different ML algorithms used across 15 application areas. The abstract syntax tree (AST) representation of code emerged as the most prevalent, while the BigCloneBench dataset was the most frequently used benchmark. The review not only catalogs existing methodologies but also highlights the integration of deep learning and hybrid ML models that combine syntactic and semantic code features to improve similarity detection accuracy and scalability. It emphasizes the trend toward more sophisticated models, such as neural networks and Siamese networks, which better capture intricate code patterns and cross-language similarities. Additionally, the review addresses challenges faced by the field, notably the scarcity of diverse and high-quality public datasets and the limitations of current ML models in handling large-scale and multilanguage code bases. It underscores the importance of developing scalable tools capable of processing complex and vast code repositories efficiently. The findings also touch on the emergence of new applications, including code recommendation systems and automated code review, which benefit from improved code similarity assessments. Future research directions proposed include enhancing the interpretability of ML models, better handling of semantic context, and expanding the scope of code similarity studies beyond conventional clone detection to include novel tasks in software maintenance and security. Overall, this systematic review serves as a detailed roadmap for researchers and practitioners aiming to apply or advance ML techniques in code similarity measurement with a focus on both theory and practical impact.

MACHINE LEARNING APPROACHES TO CODE SIMILARITY MEASUREMENT A SYSTEMATIC REVIEW

Authors

DOI:

Keywords:

Abstract

Downloads

Published

Issue

Section

How to Cite

Latest publications

IF

Information

Language