A very complex task in Computer Vision projects is to find similarity between pictures. it is something like the mean difference at each pixel really just measures the variation in color usage between pictures, rather than any real semantic characteristics of the pictures.
Auto-encoders work through an encoder and a decoder network. The encoder takes an input picture and maps it with one or more hidden layers until eventually reaching at a hidden layer representation with lower dimensional than the original picture, such as 2 or 5 neurons. The decoder then maps the results from the neurons in the low dimensional layer back to the original picture. The difference between the original picture and the picture reconstructed from encoder → decoder is known as the reconstruction loss and it is used to train the circle.
how can this be used for picture similarity?
We can take two pictures and pass them through the auto-encoder and inspect the activation in the low-dimensional representation to determine similarity between pictures.
Visualizing these activation is easily done when the representations are 2 or 3 dimensional, but with higher dimensions, we need to use clustering tricks for visualization such as t-SNE. T-SNE clustering creates a lower-dimensional mapping then preserves the spatial distance in higher dimensions. We can use this to derive a visualization of the auto-encoded feature space and thus, the visual difference between pictures.
We can enhance the functionality of the auto-encoder for images by adding convolution layers in the encoder and de-convolution layers in the decoder. The de-convolution layers can be thought of as working similarly to a generator network in a DCGAN, (Deep Constitutional Generative Adversarial Network).
lets apply to Visual Similarity
There are many interesting applications of graphical similarity metrics such as picture recommendations and search. However, one very interesting application is in data clustering and labeling unlabeled data. Many deep learning projects have massive amounts of unlabeled data and difficulty figuring out how to label this data. Auto-encoder based clustering is a great solution for this problem.
If you have any suggestion, leave a comment.