Summary
Document retrieval is defined as the matching of some stated user query against a set of free-text records. These records could be any type of mainly unstructured text, such as newspaper articles, real estate records or paragraphs in a manual. User queries can range from multi-sentence full descriptions of an information need to a few words. Document retrieval is sometimes referred to as, or as a branch of, text retrieval. Text retrieval is a branch of information retrieval where the information is stored primarily in the form of text. Text databases became decentralized thanks to the personal computer. Text retrieval is a critical area of study today, since it is the fundamental basis of all internet search engines. Document retrieval systems find information to given criteria by matching text records (documents) against user queries, as opposed to expert systems that answer questions by inferring over a logical knowledge database. A document retrieval system consists of a database of documents, a classification algorithm to build a full text index, and a user interface to access the database. A document retrieval system has two main tasks: Find relevant documents to user queries Evaluate the matching results and sort them according to relevance, using algorithms such as PageRank. Internet search engines are classical applications of document retrieval. The vast majority of retrieval systems currently in use range from simple Boolean systems through to systems using statistical or natural language processing techniques. There are two main classes of indexing schemata for document retrieval systems: form based (or word based), and content based indexing. The document classification scheme (or indexing algorithm) in use determines the nature of the document retrieval system. Form based document retrieval addresses the exact syntactic properties of a text, comparable to substring matching in string searches. The text is generally unstructured and not necessarily in a natural language, the system could for example be used to process large sets of chemical representations in molecular biology.
About this result
This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.
Related courses (4)
CS-423: Distributed information systems
This course introduces the foundations of information retrieval, data mining and knowledge bases, which constitute the foundations of today's Web-based distributed information systems.
EE-724: Human language technology: applications to information access
The Human Language Technology (HLT) course introduces methods and applications for language processing and generation, using statistical learning and neural networks.
CS-431: Introduction to natural language processing
The objective of this course is to present the main models, formalisms and algorithms necessary for the development of applications in the field of natural language information processing. The concept
Show more
Related lectures (39)
Latent Dirichlet Allocation
Covers Latent Dirichlet Allocation, a state-of-the-art method for concept extraction using a probabilistic generative model.
Projects Presentation & Logistics
Covers the presentation of 4 projects in the course and related logistics.
Vector Space Semantics (and Information Retrieval)
Explores the Vector Space model, Bag of Words, tf-idf, cosine similarity, Okapi BM25, and Precision and Recall in Information Retrieval.
Show more
Related publications (89)