Concept

Information extraction

Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents and other electronically represented sources. In most of the cases this activity concerns processing human language texts by means of natural language processing (NLP). Recent activities in multimedia document processing like automatic annotation and content extraction out of images/audio/video/documents could be seen as information extraction Due to the difficulty of the problem, current approaches to IE (as of 2010) focus on narrowly restricted domains. An example is the extraction from newswire reports of corporate mergers, such as denoted by the formal relation: from an online news sentence such as: "Yesterday, New York based Foo Inc. announced their acquisition of Bar Corp." A broad goal of IE is to allow computation to be done on the previously unstructured data. A more specific goal is to allow automated reasoning about the logical form of the input data. Structured data is semantically well-defined data from a chosen target domain, interpreted with respect to category and context. Information extraction is the part of a greater puzzle which deals with the problem of devising automatic methods for text management, beyond its transmission, storage and display. The discipline of information retrieval (IR) has developed automatic methods, typically of a statistical flavor, for indexing large document collections and classifying documents. Another complementary approach is that of natural language processing (NLP) which has solved the problem of modelling human language processing with considerable success when taking into account the magnitude of the task. In terms of both difficulty and emphasis, IE deals with tasks in between both IR and NLP. In terms of input, IE assumes the existence of a set of documents in which each document follows a template, i.e. describes one or more entities or events in a manner that is similar to those in other documents but differing in the details.

Source officielle

https://en.wikipedia.org/wiki/Information_extraction

À propos de ce résultat

Cette page est générée automatiquement et peut contenir des informations qui ne sont pas correctes, complètes, à jour ou pertinentes par rapport à votre recherche. Il en va de même pour toutes les autres pages de ce site. Veillez à vérifier les informations auprès des sources officielles de l'EPFL.

Cours associés (10)

CS-423: Distributed information systems

This course introduces the foundations of information retrieval, data mining and knowledge bases, which constitute the foundations of today's Web-based distributed information systems.

HUM-369: Digital humanities

Les Digital Humanities sont une discipline à la croisée des sciences de l'information et des sciences humaines et sociales. Dans ce cours, les étudiantes et étudiants découvrent ce domaine de recherch

ENV-540: Image processing for Earth observation

This course covers optical remote sensing from satellites and airborne platforms. The different systems are presented. The students will acquire skills in image processing and machine/deep learning to

Afficher plus

Personnes associées (1)

Daniel Gatica-Perez

Unités associées (2)

DHI - Gestion

Laboratoire de l'IDIAP

Concepts associés (15)

Reconnaissance d'entités nommées

La reconnaissance d'entités nommées est une sous-tâche de l'activité d'extraction d'information dans des corpus documentaires. Elle consiste à rechercher des objets textuels (c'est-à-dire un mot, ou un groupe de mots) catégorisables dans des classes telles que noms de personnes, noms d'organisations ou d'entreprises, noms de lieux, quantités, distances, valeurs, dates, etc. À titre d'exemple, on pourrait donner le texte qui suit, étiqueté par un système de reconnaissance d'entités nommées utilisé lors de la campagne d'évaluation MUC: Henri a acheté 300 actions de la société AMD en 2006 Henri a acheté 300 actions de la société AMD en 2006.

Fouille de textes

La fouille de textes ou « l'extraction de connaissances » dans les textes est une spécialisation de la fouille de données et fait partie du domaine de l'intelligence artificielle. Cette technique est souvent désignée sous l'anglicisme text mining. Elle désigne un ensemble de traitements informatiques consistant à extraire des connaissances selon un critère de nouveauté ou de similarité dans des textes produits par des humains pour des humains.

Informations non structurées

Les informations non structurées ou données non structurées sont des données représentées ou stockées sans format prédéfini. Ces informations sont toujours destinées à des humains. Elles sont typiquement constituées de documents textes ou multimédias, mais peuvent également contenir des dates, des nombres et des faits. Cette absence de format entraîne des irrégularités et des ambiguïtés qui peuvent rendre difficile la compréhension des données, contrairement au cas des données stockées dans des tableurs ou des bases de données par exemple, qui sont des informations structurées.

Afficher plus

Source officielle

https://en.wikipedia.org/wiki/Information_extraction

À propos de ce résultat

Cours associés (10)

CS-423: Distributed information systems

This course introduces the foundations of information retrieval, data mining and knowledge bases, which constitute the foundations of today's Web-based distributed information systems.

HUM-369: Digital humanities

ENV-540: Image processing for Earth observation

Afficher plus

Séances de cours associées (32)

Extraction d'entités et d'informations

Explore l'extraction de connaissances à partir du texte, couvrant des concepts clés tels que l'extraction de phrases clés et la reconnaissance d'entités nommées.

Extraction d'informations: Algorithmes et Techniques

Explore les algorithmes et les techniques d'extraction de l'information, y compris l'algorithme Viterbi, la reconnaissance des entités nommées, et la surveillance lointaine.

Web sémantique & Extraction d'information

Explore le Web sémantique, les ontologies, l'extraction de l'information, les phrases clés, les entités nommées et les bases de connaissances.

Afficher plus

Publications associées (28)

An Ordinal Latent Variable Model of Conflict Intensity

Robert West

Measuring the intensity of events is crucial for monitoring and tracking armed conflict. Advances in automated event extraction have yielded massive data sets of '' who did what to whom '' micro-records that enable datadriven approaches to monitoring confl ...

Assoc Computational Linguistics-Acl2023

Lausanne Historical Censuses Dataset HTR 35k

Lucas Arnaud André Rappo, Rémi Guillaume Petitpierre, Marion Kramer

This training dataset includes a total of 34,913 manually transcribed text segments. It is dedicated to the handwritten text recognition (HTR) of historical sources, typically tabular records, such as censuses. This dataset is based on a sample of 83 pages ...

Zenodo2023

From scattered sources to comprehensive technology landscape : A recommendation-based retrieval approach

Karl Aberer, Chi Thang Duong

Mapping the technology landscape is crucial for market actors to take informed investment decisions. However, given the large amount of data on the Web and its subsequent information overload, manually retrieving information is a seemingly ineffective and ...

ELSEVIER2023

Afficher plus

Personnes associées (1)

Daniel Gatica-Perez

Unités associées (2)

DHI - Gestion

Laboratoire de l'IDIAP

Concepts associés (15)

Reconnaissance d'entités nommées

Fouille de textes

Informations non structurées

Afficher plus