SOM-based Clustering of Multilingual Documents Using an Ontology

Minh Hai Pham
2008
Chapitre de livre

Résumé

Clustering similar documents is a difficult task for text data mining. Difficulties stem especially from the way documents are translated into numerical vectors. In this chapter, we will present a method that uses Self Organizing Map (SOM) to cluster medical documents. The originality of the method is that it does not rely on the words shared by documents, but rather on concepts taken from an ontology. Our goal is to cluster various medical documents in thematically consistent groups (e.g., grouping all the documents related to cardiovascular diseases). Before applying the SOM algorithm, documents have to go through several preprocessing steps. First, textual data have to be extracted from the documents, which can be either in the PDF or HTML format. Documents are then indexed, using two kinds of indexing units: stems and concepts. After indexing, documents can be numerically represented by vectors whose dimensions correspond to indexing units. These vectors store the weight of the indexing unit within the document they represent. They are given as inputs to a SOM, which arranges the corresponding documents on a two-dimensional map. We have compared the results for two indexing schemes: stembased indexing and conceptual indexing. We will show that using an ontology for document clustering has several advantages. It is possible to cluster documents written in several languages since concepts are language-independent. This is especially helpful in the medical domain where research articles are written in different languages. Another advantage is that the use of concepts helps reduce the size of the vectors, which, in turn, reduces processing time.

Source officielle

https://infoscience.epfl.ch/record/125871?ln=fr

À propos de ce résultat

Cette page est générée automatiquement et peut contenir des informations qui ne sont pas correctes, complètes, à jour ou pertinentes par rapport à votre recherche. Il en va de même pour toutes les autres pages de ce site. Veillez à vérifier les informations auprès des sources officielles de l'EPFL.

SOM-based Clustering of Multilingual Documents Using an Ontology

Graph Chatbot

Chattez avec Graph Search

Subspace clustering in high-dimensions: Phase transitions & Statistical-to-Computational gap

New Multi-Keyword Ciphertext Search Method for Sensor Network Cloud Platforms

Robust and Efficient Data Clustering with Signal Processing on Graphs

Robust and Efficient Data Clustering with Signal Processing on Graphs

New Multi-Keyword Ciphertext Search Method for Sensor Network Cloud Platforms

Subspace clustering in high-dimensions: Phase transitions & Statistical-to-Computational gap