ABSTRACT: Analysis of prevalent document management practices shows the popular use of categories (e.g., folders) to organize documents for subsequent searches and retrievals. The coherence and distinction of an existing document category can diminish considerably as influxes of new documents arrive over time. The complexity of and effort requirements for document-category management favor an automated approach that can be supported by appropriate document-clustering techniques. A review of the extant literature shows a predominant focus on document content analysis in automated document-category management, which cannot preserve the user's document-grouping preferences. This research develops two advanced evolution-based techniques for preserving user preferences in their management of document categories. The first technique (CE2), which supports the automated evolution of a set of flat (i.e., nonhierarchical) document categories, extends a promising evolution-based technique (category evolution, CE) by addressing its fundamental limitations inherent to the use of holistic measures. The second technique, category hierarchy evolution (CHE), is developed on the basis of CE2 to support scenarios where document categories are organized with a hierarchical structure. Empirical evaluations of the effectiveness of each technique in various category evolution scenarios created using two different document corpora (i.e., news documents from Reuters and research articles from the ACM digital library), as compared with those of associated salient techniques for benchmark purposes, show that CE2 and CHE outperform their respective benchmark techniques. Their performance is reasonably robust and appears more effective when the quality (coherence) of the previously created categories does not deteriorate excessively. According to our results, the evolution-based approach is viable, appealing, and capable of preserving user preferences in automatic reorganizations of document categories.
Key words and phrases: category evolution, category hierarchy evolution, document-category management, document clustering, hierarchical agglomerative clustering, text mining