Concept

Information extraction

Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents and other electronically represented sources. In most of the cases this activity concerns processing human language texts by means of natural language processing (NLP). Recent activities in multimedia document processing like automatic annotation and content extraction out of images/audio/video/documents could be seen as information extraction Due to the difficulty of the problem, current approaches to IE (as of 2010) focus on narrowly restricted domains. An example is the extraction from newswire reports of corporate mergers, such as denoted by the formal relation: from an online news sentence such as: "Yesterday, New York based Foo Inc. announced their acquisition of Bar Corp." A broad goal of IE is to allow computation to be done on the previously unstructured data. A more specific goal is to allow automated reasoning about the logical form of the input data. Structured data is semantically well-defined data from a chosen target domain, interpreted with respect to category and context. Information extraction is the part of a greater puzzle which deals with the problem of devising automatic methods for text management, beyond its transmission, storage and display. The discipline of information retrieval (IR) has developed automatic methods, typically of a statistical flavor, for indexing large document collections and classifying documents. Another complementary approach is that of natural language processing (NLP) which has solved the problem of modelling human language processing with considerable success when taking into account the magnitude of the task. In terms of both difficulty and emphasis, IE deals with tasks in between both IR and NLP. In terms of input, IE assumes the existence of a set of documents in which each document follows a template, i.e. describes one or more entities or events in a manner that is similar to those in other documents but differing in the details.

Official source

https://en.wikipedia.org/wiki/Information_extraction

About this result

This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.

Related courses (10)

CS-423: Distributed information systems

This course introduces the foundations of information retrieval, data mining and knowledge bases, which constitute the foundations of today's Web-based distributed information systems.

HUM-369: Digital humanities

Les Digital Humanities sont une discipline à la croisée des sciences de l'information et des sciences humaines et sociales. Dans ce cours, les étudiantes et étudiants découvrent ce domaine de recherch

ENV-540: Image processing for Earth observation

This course covers optical remote sensing from satellites and airborne platforms. The different systems are presented. The students will acquire skills in image processing and machine/deep learning to

Related publications (28)

An Ordinal Latent Variable Model of Conflict Intensity

Robert West

Measuring the intensity of events is crucial for monitoring and tracking armed conflict. Advances in automated event extraction have yielded massive data sets of '' who did what to whom '' micro-records that enable datadriven approaches to monitoring confl ...

Assoc Computational Linguistics-Acl2023

Lausanne Historical Censuses Dataset HTR 35k

Lucas Arnaud André Rappo, Rémi Guillaume Petitpierre, Marion Kramer

This training dataset includes a total of 34,913 manually transcribed text segments. It is dedicated to the handwritten text recognition (HTR) of historical sources, typically tabular records, such as censuses. This dataset is based on a sample of 83 pages ...

Zenodo2023

From scattered sources to comprehensive technology landscape : A recommendation-based retrieval approach

Karl Aberer, Chi Thang Duong

Mapping the technology landscape is crucial for market actors to take informed investment decisions. However, given the large amount of data on the Web and its subsequent information overload, manually retrieving information is a seemingly ineffective and ...

ELSEVIER2023

Related people (1)

Daniel Gatica-Perez

Related units (2)

DHI - Administration

L'IDIAP Laboratory

Official source

https://en.wikipedia.org/wiki/Information_extraction

About this result

Related courses (10)

CS-423: Distributed information systems

This course introduces the foundations of information retrieval, data mining and knowledge bases, which constitute the foundations of today's Web-based distributed information systems.

HUM-369: Digital humanities

ENV-540: Image processing for Earth observation

Related lectures (32)

Entity & Information Extraction

Explores knowledge extraction from text, covering key concepts like keyphrase extraction and named entity recognition.

Information Extraction: Algorithms and Techniques

Explores algorithms and techniques for information extraction, including Viterbi algorithm, named entities recognition, and distant supervision.

Semantic Web & Information Extraction

Explores Semantic Web, ontologies, information extraction, key phrases, named entities, and knowledge bases.

Related publications (28)

An Ordinal Latent Variable Model of Conflict Intensity

Robert West

Assoc Computational Linguistics-Acl2023

Lausanne Historical Censuses Dataset HTR 35k

Lucas Arnaud André Rappo, Rémi Guillaume Petitpierre, Marion Kramer

Zenodo2023

From scattered sources to comprehensive technology landscape : A recommendation-based retrieval approach

Karl Aberer, Chi Thang Duong

ELSEVIER2023

Related people (1)

Daniel Gatica-Perez

Related units (2)

DHI - Administration

L'IDIAP Laboratory

Related concepts (15)

Named-entity recognition

Named-entity recognition (NER) (also known as (named) entity identification, entity chunking, and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc. Most research on NER/NEE systems has been structured as taking an unannotated block of text, such as this one: Jim bought 300 shares of Acme Corp.

Text mining

Text mining, text data mining (TDM) or text analytics is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources." Written resources may include websites, books, emails, reviews, and articles. High-quality information is typically obtained by devising patterns and trends by means such as statistical pattern learning. According to Hotho et al.

Unstructured data

Unstructured data (or unstructured information) is information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured information is typically text-heavy, but may contain data such as dates, numbers, and facts as well. This results in irregularities and ambiguities that make it difficult to understand using traditional programs as compared to data stored in fielded form in databases or annotated (semantically tagged) in documents.