Are you an EPFL student looking for a semester project?
Work with us on data science and visualisation projects, and deploy your project as an app on top of Graph Search.
The bag-of-words model is a simplifying representation used in natural language processing and information retrieval (IR). In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity. The bag-of-words model has also been used for computer vision. The bag-of-words model is commonly used in methods of document classification where the (frequency of) occurrence of each word is used as a feature for training a classifier. An early reference to "bag of words" in a linguistic context can be found in Zellig Harris's 1954 article on Distributional Structure. The Bag-of-words model is one example of a Vector space model. The following models a text document using bag-of-words. Here are two simple text documents: (1) John likes to watch movies. Mary likes movies too. (2) Mary also likes to watch football games. Based on these two text documents, a list is constructed as follows for each document: "John","likes","to","watch","movies","Mary","likes","movies","too" "Mary","also","likes","to","watch","football","games" Representing each bag-of-words as a JSON object, and attributing to the respective JavaScript variable: BoW1 = {"John":1,"likes":2,"to":1,"watch":1,"movies":2,"Mary":1,"too":1}; BoW2 = {"Mary":1,"also":1,"likes":1,"to":1,"watch":1,"football":1,"games":1}; Each key is the word, and each value is the number of occurrences of that word in the given text document. The order of elements is free, so, for example {"too":1,"Mary":1,"movies":2,"John":1,"watch":1,"likes":2,"to":1} is also equivalent to BoW1. It is also what we expect from a strict JSON object representation. Note: if another document is like a union of these two, (3) John likes to watch movies. Mary likes movies too. Mary also likes to watch football games. its JavaScript representation will be: BoW3 = {"John":1,"likes":3,"to":2,"watch":2,"movies":2,"Mary":2,"too":1,"also":1,"football":1,"games":1}; So, as we see in the bag algebra, the "union" of two documents in the bags-of-words representation is, formally, the disjoint union, summing the multiplicities of each element.
Daniel Gatica-Perez, Skanda Muralidhar, Lakmal Buddika Meegahapola
Jérôme Baudry, Nicolas Christophe Chachereau, Bhargav Srinivasa Desikan, Prakhar Gupta
Patrick Thiran, Matthias Grossglauser, Victor Kristof