Artificial Intelligence to advance machine vision
By : Jobin Wilson , Muhammad Arif
Flytxt Data Science R&D Team
Human brain is capable of some amazing tasks like understanding the world in a single visual frame. It takes only a few tens of milliseconds for the brain to recognize the category of an object or environment. Further, humans are capable of learning and remembering a diverse set of places and patterns, and solving complex problems such as planning and navigation, involving vision, perception and cognition. The neural architecture of human beings have inspired researchers to simulate such abilities on machines to solve challenging problems using artificial intelligence. Consequently, deep learning has emerged as a powerful tool to solve problems involving machine vision and perception. Through artificial intelligence, machines have come closer to human ability in several cognitive tasks such as identifying and recognising objects and environment.
Image classification is one of the hallmark tasks of computer vision. It allows defining a context for object recognition which will have diverse applications. The classical problem in computer vision, is that of determining whether or not image data contains some specific object, feature, or an activity of interest.
Data Science R&D team at Flytxt has released an end-to-end scene recognition pipeline consisting of feature extraction, encoding, pooling and classification. The primary objective of this work is to clearly outline the practical implementation of a basic scene-recognition pipeline having a reasonable accuracy, using conventional computer vision techniques (without applying deep learning techniques), in python, using open-source libraries.
Scene recognition approach with local and global descriptors
The approach used by Flytxt R&D team utilizes global feature descriptors as well as local feature descriptors from images simultaneously, to form a hybrid feature descriptor corresponding to each image. It comprises of using DAISY features associated with key points within images as the local feature descriptor (similar to SIFT features) and histogram of oriented gradients (HOG) corresponding to an entire image as a global descriptor.
As images vary in view point, scale, orientation, illumination and occlusion level of objects, extracting robust features (such as DAISY, SIFT, HOG etc.) to represent images is critical for building an effective image classification model. As the number of key points vary across images, multiple DAISY descriptors would exist for each image. We use a bag-of-visual-words concept to encode each image as a histogram of dimensionality ‘K’ (where K is the vocabulary size or the number of possible “visual words”). Clustering is used to group DAISY features to form the “visual words” for encoding. Since training data could have several images, total number of DAISY descriptors could be very large (in millions). We use Mini-Batch K-Means algorithm to reduce the complexity of clustering, for fast encoding. The histogram corresponding to each image is augmented with its HOG descriptor using a pooling procedure, to generate the final feature vector corresponding to each image. The associated class label (e.g. living room, store etc.) would be already available since the training dataset is pre-labelled.
A multi-class SVM (each class corresponds to a scene category such as living room, store etc.) is trained and cross validated to assess the model quality on the fifteen scene categories dataset. The average accuracy of the model was 76.4% in the case of a 40%–60% random split of images into training and testing datasets respectively.
A detailed description of the approach is available in here. Also, a full implementation of the proposed model is available here.