Publication

SOM-based Clustering of Multilingual Documents Using an Ontology

Minh Hai Pham
2008
Chapitre de livre
Résumé

Clustering similar documents is a difficult task for text data mining. Difficulties stem especially from the way documents are translated into numerical vectors. In this chapter, we will present a method that uses Self Organizing Map (SOM) to cluster medical documents. The originality of the method is that it does not rely on the words shared by documents, but rather on concepts taken from an ontology. Our goal is to cluster various medical documents in thematically consistent groups (e.g., grouping all the documents related to cardiovascular diseases). Before applying the SOM algorithm, documents have to go through several preprocessing steps. First, textual data have to be extracted from the documents, which can be either in the PDF or HTML format. Documents are then indexed, using two kinds of indexing units: stems and concepts. After indexing, documents can be numerically represented by vectors whose dimensions correspond to indexing units. These vectors store the weight of the indexing unit within the document they represent. They are given as inputs to a SOM, which arranges the corresponding documents on a two-dimensional map. We have compared the results for two indexing schemes: stembased indexing and conceptual indexing. We will show that using an ontology for document clustering has several advantages. It is possible to cluster documents written in several languages since concepts are language-independent. This is especially helpful in the medical domain where research articles are written in different languages. Another advantage is that the use of concepts helps reduce the size of the vectors, which, in turn, reduces processing time.

À propos de ce résultat
Cette page est générée automatiquement et peut contenir des informations qui ne sont pas correctes, complètes, à jour ou pertinentes par rapport à votre recherche. Il en va de même pour toutes les autres pages de ce site. Veillez à vérifier les informations auprès des sources officielles de l'EPFL.

Graph Chatbot

Chattez avec Graph Search

Posez n’importe quelle question sur les cours, conférences, exercices, recherches, actualités, etc. de l’EPFL ou essayez les exemples de questions ci-dessous.

AVERTISSEMENT : Le chatbot Graph n'est pas programmé pour fournir des réponses explicites ou catégoriques à vos questions. Il transforme plutôt vos questions en demandes API qui sont distribuées aux différents services informatiques officiellement administrés par l'EPFL. Son but est uniquement de collecter et de recommander des références pertinentes à des contenus que vous pouvez explorer pour vous aider à répondre à vos questions.