In information retrieval, tf–idf (also TF*IDF, TFIDF, TF–IDF, or Tf–idf), short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general. tf–idf has been one of the most popular term-weighting schemes. A survey conducted in 2015 showed that 83% of text-based recommender systems in digital libraries use tf–idf. Variations of the tf–idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query. tf–idf can be successfully used for stop-words filtering in various subject fields, including text summarization and classification. One of the simplest ranking functions is computed by summing the tf–idf for each query term; many more sophisticated ranking functions are variants of this simple model. Suppose we have a set of English text documents and wish to rank them by which document is more relevant to the query, "the brown cow". A simple way to start out is by eliminating documents that do not contain all three words "the", "brown", and "cow", but this still leaves many documents. To further distinguish them, we might count the number of times each term occurs in each document; the number of times a term occurs in a document is called its term frequency. However, in the case where the length of documents varies greatly, adjustments are often made (see definition below). The first form of term weighting is due to Hans Peter Luhn (1957) which may be summarized as: The weight of a term that occurs in a document is simply proportional to the term frequency.

About this result
This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.
Related courses (5)
CS-423: Distributed information systems
This course introduces the foundations of information retrieval, data mining and knowledge bases, which constitute the foundations of today's Web-based distributed information systems.
CS-401: Applied data analysis
This course teaches the basic techniques, methodologies, and practical skills required to draw meaningful insights from a variety of data, with the help of the most acclaimed software tools in the dat
EE-724: Human language technology: applications to information access
The Human Language Technology (HLT) course introduces methods and applications for language processing and generation, using statistical learning and neural networks.
Show more
Related lectures (41)
Vector Space Retrieval Exercise
Covers TF-IDF computation, document vectors, cosine similarity, and precision formulas.
Handling Text: Document Retrieval, Classification, Sentiment Analysis
Explores document retrieval, classification, sentiment analysis, TF-IDF matrices, nearest-neighbor methods, matrix factorization, regularization, LDA, contextualized word vectors, and BERT.
Information Retrieval Basics
Introduces the basics of information retrieval, covering text-based and Boolean retrieval, vector space retrieval, and similarity computation.
Show more
Related publications (30)

Crosslingual Document Embedding as Reduced-Rank Ridge Regression

Martin Jaggi, Robert West, Martin Josifoski, Ivan Paskov

There has recently been much interest in extending vector-based word representations to multiple languages, such that words can be compared across languages. In this paper, we shift the focus from words to documents and introduce a method for embedding doc ...
2019

New Multi-Keyword Ciphertext Search Method for Sensor Network Cloud Platforms

Jiyong Zhang, Yue Wang, Hongyu Yang

This paper proposed a multi-keyword ciphertext search, based on an improved-quality hierarchical clustering (MCS-IQHC) method. MCS-IQHC is a novel technique, which is tailored to work with encrypted data. It has improved search accuracy and can self-adapt ...
MDPI2018

Using Networks to Visualize Publications

Dario Rodighiero

Retrieval systems are often shaped as lists organized in pages. However, the majority of users look at the first page ignoring the other ones. This presentation concerns an alterna- tive way to present the results of a query using network visualizations.
 ...
2018
Show more
Related concepts (4)
Latent semantic analysis
Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text (the distributional hypothesis).
Automatic summarization
Automatic summarization is the process of shortening a set of data computationally, to create a subset (a summary) that represents the most important or relevant information within the original content. Artificial intelligence algorithms are commonly developed and employed to achieve this, specialized for different types of data. Text summarization is usually implemented by natural language processing methods, designed to locate the most informative sentences in a given document.
Document classification
Document classification or document categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more classes or categories. This may be done "manually" (or "intellectually") or algorithmically. The intellectual classification of documents has mostly been the province of library science, while the algorithmic classification of documents is mainly in information science and computer science.
Show more

Graph Chatbot

Chat with Graph Search

Ask any question about EPFL courses, lectures, exercises, research, news, etc. or try the example questions below.

DISCLAIMER: The Graph Chatbot is not programmed to provide explicit or categorical answers to your questions. Rather, it transforms your questions into API requests that are distributed across the various IT services officially administered by EPFL. Its purpose is solely to collect and recommend relevant references to content that you can explore to help you answer your questions.