The knowledge acquisition methods are devoted to intelligent processing of artifacts generated within a domain of software development, especially source codes, documentation, but also “side products” of the developer activity such as notes or web search logs (generated by developers when they search for online help). The important result of this processing is resource metadata layer, having a form of information tags that serve as input for other methods.
Considering metadata creation, we rely on several approaches. For describing individual source code entities, we devised methods for source code marking, tier detection and authorship detection. Other automated methods are devoted to identify similarities between the code entities (useable especially for recommendation) based on abstract syntax trees and on document segmentation. Using a crowd-based method, we also interconnect metadata-layer entities.
Equally important as the metadata creation is the metadata maintenance. In the context of software project, this is a permanent as well as non-trivial process. Because the software artifacts, such as source codes, change over time (and can change at very high rates), the metadata describing them must be either updated or discarded as quickly as possible (ideally upon each commit). Yet, a brute force re-creation of the information tags is not possible (due to limited time and manpower resources) and so intelligent methods had to be developed to perform the upkeep in less resource-consuming manner. To do so, we followed two approaches: the automated one, which detects possibly invalid metadata by analyzing micro-changes of the resource and the crowd-based one that uses manpower of a crowd of currently available developers to perform the validation tasks in cases that are marked as “uncertain” by automated approaches.
Finally, an end user, search-in-source-codes functionality is provided by a method that consider code fragment popularity as a main result ranking approach.
It is worth to note that presented approaches might be based apart from actual representation of various domain artifacts also on data of their usage or processing. However, they do not consider any explicit user model of an individual, which is the reason why they are presented here.