Institution: | Slovak University of Technology |
Technologies used: | C#.NET |
Inputs: | Set of text document in .TXT format (already preprocessed) |
Outputs: | Segment graph containing different types of relations between text documents |
Addressed problem
Document similarity can be viewed on different levels. For instance, when a user searches for data on the internet, normally he is interested in resources that are topic similar to his target domain. On the other side, to identify plagiarism in academic environment we are no more interested in topic similarity. We want to identify paraphrasing or copy-paste pattern among text data. But both of these cases have something in common. In both we identify relations among documents that are of different types. Successful identification of such relations is in our opinion crucial for similarity detection among text data. Common problem in identifying relations among texts is to successfully determine their type. For instance, a text document can represent a detailed description of a section belonging to other documents. Two different documents can describe a similar topic throughout their whole content. Document d1 can paraphrase document d2, or can be a plagiary of document d2. In all of these cases different relations exist between document pairs. Moreover, long text documents are heterogeneous which means that they usually discuss several topics in their different parts. In order to better model the heterogeneousness of topics in text data we propose to segment documents on subtopics. Then we identify different types of relations between these subtopics (not their whole content).
Description
We propose a new model for identification of relationships between text documents written in natural language that is based on construction of a segment graph. We model a text document as a set of topics – one global topic and several local topics. A global topic describes the overall content of a text document while local topic (segment) represents a subtopic in this document. Text document can contain several or none local topics, but it always contains exactly one global topic. Every local and global topic is represented by a set of keywords. The segmentation process on local and global segments (from now we can speak about local and global segments) is based on watching the distribution of keywords in a document’s content. The relations between text documents are determined based on the type of segments that are related. For instance, when two local segments are related we talk about LL (local to local) relation. The LL relation models a situation in which two different documents share a part of their content. Following this ides, other relations as GG or LG are possible. The final output of the method is the segment graph that contains local and global segments (vertexes) and relations that connect these segments (edges). Based on the relations in this graph we are able to extract similar parts of the documents and based on different types of these relations (the type of relation depends on what types of segments it connects) we are able to identify topic hierarchies.