Source Code Fragments Similarities Detection Using Abstract Syntax Trees

Institution:	Slovak University of Technology
Technologies used:	C#.NET
Inputs:	source codes
Outputs:	similar/duplicated source code fragments interconnected by information tags

Addressed problem

Copying parts of source code as one way of code re-use is widely used by programmers. It is being done despite the fact that not only every programmer knows why he should not do it, but certainly already suffered the consequences of duplicated code. These for example include more effort needed to maintain the code as well as a large number of errors in the software related to the fact that duplicated code has been modified in some places and not elsewhere. However, there is only a handful of tools that search for the duplicated code and even those are not used in practice. We believe that the reasons may be the high computational and memory requirements or insufficient results. We therefore propose a search method for detecting duplicated code that uses abstract syntax trees. Our goal is to create a practically useful tool for incremental detection of duplicated code.

Description

In our solution, we decided to compare the source code using abstract syntactic trees, mainly because they are an abstraction from source code's textual representation and because the information about the structure of the code remains unchanged. Using abstract syntax trees in the field of static source code analysis has been proved as effective - for example in many source code editors (or integrated development environments).

We decided to use incremental detection of duplicates because we also think that it is very important for the practical applicability of tools for finding duplicated code. Its main advantage is that when source code is changed it is necessary to only examine the changed parts, not the whole code base (which can be computationally intensive, especially for large software projects). To achieve incremental code duplicate detection, it is necessary to maintain some index of the source codes. This can be seen as a disadvantage compared to a one-time methods which do not require such index.

The input of this method is the list of changes in abstract syntax trees which occurred between two versions of a source code. The output is the list of groups of duplicated code.

References

J. Súkeník, P. Lacko: Duplicated code detection, In: WIKT 2012 : 7th Workshop on Intelligent and Knowledge Oriented Technologies Proceedings, November 22-23, 2012 Smolenice, Slovakia. - Bratislava : STU Press, 2012. (In Slovak)

J. Súkeník: Source Code Analysis Using Abstract Syntax Trees, Diploma thesis, Bratislava: FIIT STU, 2013 (In Slovak)

Source Code Fragments Similarities Detection Using Abstract Syntax Trees

Addressed problem

Description

References

Partners