Multimodal learning, in context of machine learning, is deep learning from a combination of various modalities of data, often arising in real-world applications. An example of multi-modal data is data that combines text (typically represented as feature vector) with imaging data consisting of pixel intensities and annotation tags. As these modalities have fundamentally different statistical properties, combining them is non-trivial, which is why specialized modelling strategies and algorithms are required. Many models and algorithms have been implemented to retrieve and classify certain types of data, e.g. image or text (where humans who interact with machines can extract images in form of pictures and texts that could be any message etc.). However, data usually come with different modalities (it is the degree to which a system's components may be separated or combined) which carry different information. For example, it is very common to caption an image to convey the information not presented in the image itself. Similarly, sometimes it is more straightforward to use an image to describe the information which may not be obvious from texts. As a result, if different words appear in similar images, then these words likely describe the same thing. Conversely, if a word is used to describe seemingly dissimilar images, then these images may represent the same object. Thus, in cases dealing with multi-modal data, it is important to use a model which is able to jointly represent the information such that the model can capture the correlation structure between different modalities. Moreover, it should also be able to recover missing modalities given observed ones (e.g. predicting possible image object according to text description). The Multimodal Deep Boltzmann Machine model satisfies the above purposes. A Boltzmann machine is a type of stochastic neural network invented by Geoffrey Hinton and Terry Sejnowski in 1985. Boltzmann machines can be seen as the stochastic, generative counterpart of Hopfield nets.