Identifying dog breeds using Keras

Modern deep learning architectures are becoming increasingly effective in various fields of artificial intelligence. One of these fields is image classification. In this post, we're going to see if we can achieve an accurate classification of images by applying out-of-the-box ImageNet pre-trained deep models using the Keras library.

Data set

The dataset is taken from the Dog Breed Identification competition hosted on Kaggle, a data science and machine learning competitions hosting platform. It contains approximately 10,000 labeled images, each of them depicts a dog from one of 120 breeds, and the same amount of testing (i.e. unlabeled) data. Generally, these images have different resolutions, various zoom levels, they could have more than one dog shown, and were taken in various lighting conditions. Below are a few examples from the dataset.

The dog breeds represented in the dataset are more or less balanced (i.e. each of them has a comparable number of observations) with 59 samples per breed on average. Below is the distribution of dogs breeds in the dataset.

As our brief analysis shows, the analyzed dataset is not too sophisticated for modern deep learning architectures and has quite a simple structure. Therefore, we can expect good results and high accuracy for all breeds represented in the dataset.

Bottleneck features 

One of the most straightforward strategies to apply pre-trained deep learning models is to use them as feature extractorsBefore modern neural architectures were developed, image features were extracted using manually crafted filters, like the Sobel operator in Canny edge detectors. The operator is a three by three matrix which "slides" through a source image and converts a square patch of adjacent pixels into a single value as below.

Figure 3. Schematical representation of Sobel operator applied to a single image's channel (source)

Nowadays it is possible to derive features automatically, from the data itself. For this purpose, we can use convolutional layers. Each convolution layer is a stack of square matrices with randomly initialized values. These values are updated during the training process and eventually converge to specific filters well tuned to the dataset you have.

The image below shows a schematical representation of a deep learning classifier as a sequence of feature blocks (i.e. layers), transforming an image from "raw" pixels into more and more abstract representations.

As the picture above shows, a deep learning network without the top (orange) layers generates a set of high-level features for each image passed through the network. These features are called bottleneck features. The deep model automatically infers these abstract image features, and we can use them with any "classical" machine learning algorithm to predict targets.

Note that all models and feature extractors mentioned in this post were executed on a single 1080Ti GPU. It could take several hours to extract features on relatively large (thousands of images) datasets when using CPU.

Therefore, to extract features from the set of images one needs to load a deep network with pre-trained weights, but without top layers and then make "predictions" for these images. The following snippet shows a possible implementation of the FeatureExtractor class which is a convenient wrapper, loading neural network and its weights into memory and using it to convert images into bottleneck features:

Line 15 creates a Keras model without top layers, but with pre-loaded ImageNet weights. Then, lines 22-25 iterate through all available images and convert them into arrays of features. The arrays are then saved into persistent memory in line 29. Note that we do not load all available pictures into memory at once but create a generator instead that reads files in chunks from the disk. Such approach allows dealing with huge datasets even if you don't have enough RAM on your machine because images consume a lot of memory when converted from JPEG or PNG formats into numerical arrays.

Then, one can use the FeatureExtractor class as shown below:

from keras.applications import inception_v3

extractor = FeatureExtractor(
    build_fn=inception_v3.InceptionV3,
    preprocess_fn=inception_v3.preprocess_fn,
    source=create_files_iterator_factory())

extractor(folder_name, output_file)

Bootstrapped SGD

Stochastic Gradient Descent (SGD) is a simple yet very efficient approach to discriminative learning of linear classifiers under convex loss functions. Simply speaking, the algorithm takes arrays of image features and their labels and trains a multi-class classifier to predict dogs breeds.

There is one specific feature of SGD classifier though, it is not very robust. You can run the training algorithm several times, and each time it will give you a different accuracy of predictions because the algorithm uses a random initialization of its parameters at the beginning of learning process. The final solution changes depending on the quality of the "first guess", so to speak.

To make the training process more stable and repeatable, and to get the best possible accuracy, we're going to extend SGD with bagging, an approach that trains an ensemble of SGD classifiers, and gives a final prediction by averaging responses from separate estimators. The image below gives a schematic representation of this,

Figure 5. Applying bagging technique to the SGD classifier

Also note that in our case, as the scheme shows, we're not only splitting the original dataset into subsets but also taking different subsets of features to train each classifier. Another improvement is the applying of a variance threshold transformer to filter out bottleneck features which values are to close to zero because it seems that feature vectors extracted by deep learning networks tend to be quite sparse.

The next snippet shows how to create an SGD ensemble classifier and how to compute prediction metrics reflecting the quality of our model:

Line 4 creates a single instance of the SGD classifier with a couple of regularization parameters and permission to use all available CPUs to train the model. (You can read more about classifier configuration in its documentation). Line 10 creates an ensemble of classifiers. Lines 15-20 train the classifier and compute several performance metrics.

SGD benchmark

There are many pre-trained deep learning architectures available. Do they perform equally well when used as feature extractors on our dataset? Let's try a few of them and see. 

The following architectures were chosen to extract features which were used to train SGD classifiers:

  1. InceptionV3
  2. InceptionResNetV2
  3. Xception

All these architectures are available in Keras and are variations of Google's Inception architecture which has shown good results on ImageNet. Each classifier was trained on 9200 samples and validated on 1022 images. The table below shows the prediction results of the training and validation subsets.

NetworkAccuracy (train)Accuracy (valid)Loss (train)Loss (valid)
InceptionV394.02%88.55%0.31710.4714
Xception95.40%90.80%0.28470.4103
InceptionResNetV294.53%92.47%0.19890.3027


Not bad! These results won't put you into the top leaderboard position but having 92.47% accuracy on a dataset with 120 classes seems like a decent result, taking into account how simply it was achieved using modern deep learning frameworks and models.

Fine-tuning pre-trained models

The training of an ensemble of SGD classifiers on bottleneck features has shown that these features achieve reasonably good prediction results. However, could we improve the classifier's accuracy with some fine-tuning of the original models by re-training top dense layers from scratch? Also, can we somehow pre-process our training set to make the model more robust to overfitting, and to improve its generalization capabilities?

The purpose of the fine-tuning process is to adjust a pre-trained model on your data. In most cases, the model you're going to re-use was trained on a dataset with a different number of classes. Therefore, you need to (at least) replace a top classification layer of the network. Other layers could be "locked", or frozen, i.e., during training the new top layer their weights do not receive updates. Figure 6 below shows a schematical representation of the process described.

Figure 6. Schematical representation of the fine-tuning process.

The process of the new top layer training is not too different from the previous approach of using a network as a feature extractor. We only use a different type of classifier (one-layer fully-connected network). However, we can improve network's generalization capabilities extending the training process with data augmentation techniques. Each fine-tuned network is trained using slightly modified copies of images from the original dataset (a bit rotated, zoomed in, etc.) Using the previous approach, we would be required to store all augmented images somewhere before showing them to the training algorithm.

Fine-Tuning benchmark

To test fine-tuning accuracy, we're going to use the same architectures with an additional one:

  1. ResNet50
  2. InceptionV3
  3. Xception
  4. InceptionResNetV2

Each model was trained for 100 epochs with early stopping and with 128 samples per batch using the same optimizer, SGD with Nesterov momentum enabled:

from keras.optimizers import SGD
sgd = SGD(lr=0.001, momentum=0.99, nesterov=True)

The following data augmentation parameters were chosen:

from keras.preprocessing.image import ImageDataGenerator
transformer = ImageDataGenerator(
    width_shift_range=0.2,
    height_shift_range=0.2,
    zoom_range=0.2,
    rotation_range=30,
    vertical_flip=False,
    horizontal_flip=True)

And, finally, the model training process could be implemented as follows:

from keras.models import Model
from keras.layers import Dense

base = create_model(include_top=False)
x = Dense(120, activation='softmax')(x)
model = Model(inputs=base.inputs, outputs=x)
model.compile(optimizer=sgd, loss='categorical_crossentropy')

train_gen = create_training_generator()
valid_gen = create_validation_generator()
model.fit_generator(train_gen, validation_data=valid_gen)

The table below shows the performance of the trained models. The 'Public Score' column contains the loss value which was reported after submitting classification results to Kaggle.

NetworkAccuracy (valid)Loss (valid)Public Score
ResNet5073.39%0.92390.94090
InceptionV388.26%0.34460.34328
Xception90.31%0.31320.34168
InceptionResNetV291.59%0.25610.28052


Well, not a big improvement compared to results achieved previously, but we have tried to fine-tune a single fully-connected layer only, which is not that different from our "shallow" classifiers we've trained on bottleneck features. Nevertheless, data augmentation and a single layer dense network on top of the pre-trained InceptionResNetV2 network has shown the best result among all classifiers discussed in this post.

Example: predicted breeds

We are talking about loss and accuracy metrics and models training all the way down, but let's see how actual predictions look using a few brand-new images from the Internet which are not present in any of the data subsets. The image below shows the five most probable model predictions for each testing image. 

Finally, we're getting (a probabilistic) answer to our question asked in the post's heading picture: the dog it shows is likely to be an American Staffordshire Terrier.

Note that the model is not certain at all about a breed of the cat from the bottom right picture. It means that we could use our model as a kind of "dog detector". If we're going to detect dogs similar to ones from the training dataset, of course.

Conclusion

Out-of-the-box models (as feature extractors or with fine-tuned top layers and data augmentation) have demonstrated good results in classifying images from the dog breeds dataset while requiring very little effort to be trained and to be used.

I believe that one could get much better results with the aforementioned networks if adding more top layers, regularization techniques, and testing various optimization algorithms, "unfreezes" more hidden layers, etc. Nevertheless, I would say that it is a good idea to start with a simple base model like one shown in this post to set a "lower bound" on accuracy/loss metrics before trying more sophisticated solutions.

One of the drawbacks of this analysis is that the selected dataset was built by taking canine class images from ImageNet, this means that we are running our networks on data which has probably been seen by them already. Therefore, the reported results could be a bit biased when comparing with running pre-trained models on brand-new data.

The analysis discussed in this post can be found in this repository.