Paragraph-based representation of texts: A complex networks approach

Abstract

An interesting model to represent texts as a graph (also called network) is the word adjacency (co-occurrence) representation, which is known to capture mainly syntactical features of texts. In this study, we propose a novel network model, which is based on the similarity between the content of the paragraphs of the text. By considering this representation, we characterized the networks with respect to measurements developed in the network science area. We characterized these measurements according to their properties regarding their ability to discriminate between real and shuffled texts, and to capture information regarding the content similarity of chunks of text. In order to compare the results with a more sophisticated approach, we employed a methodology based on doc2vec. When comparing real and shuffled texts, the results revealed that real texts tend to have a more well-defined community structure. This characteristic can be related to the organization of subjects in real texts. The network-based measurements that were found to be able to discriminate real from shuffled texts were used as features in a classifier. As a result, the obtained accuracy was 98.72%. In order to compare with a different methodology, we used doc2vec-based features in the classifier, yielding an accuracy rate of 70.8%. The proposed network-based features were employed to analyze the Voynich manuscript, which was found to be compatible with real texts according to the considered characteristics.

Publication
Information Processing & Management
Date