Exploring Automated Image Captioning With Attention Mechanisms

Introduction

In this blog post, we'll be diving into the fascinating world of artificial intelligence by exploring a very random yet essential topic in the field – automated image captioning with attention mechanisms. Automated image captioning is the task of generating a textual description of an image, and attention mechanisms have gained popularity in recent years for their ability to improve the performance of neural networks in various tasks.

One popular approach for image captioning is to use a combination of Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). CNNs are used to extract features from the images, while RNNs generate the textual description based on these features. However, this approach lacks the ability to focus on specific parts of the image while generating the text. Attention mechanisms help to solve this problem by enabling the model to weigh different parts of the image according to their relevance to the textual description being generated.

In this tutorial, we will be using TensorFlow and Keras to implement an image captioning model with attention mechanisms. We will use the MS-COCO dataset for training and evaluation.

Preparing the Data

Before we dive into the code, let's prepare our data. The MS-COCO dataset consists of images and their corresponding captions. We will first preprocess the images, followed by tokenizing the captions to create a vocabulary.

import tensorflow as tf
from tensorflow.keras.preprocessing.image import load_img, img_to_array
from tensorflow.keras.applications.inception_v3 import InceptionV3, preprocess_input
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Load the pre-trained InceptionV3 model
inception = InceptionV3(weights='imagenet', include_top=False, pooling='avg')

Image Preprocessing

def preprocess_image(img_path):
    img = load_img(img_path, target_size=(299, 299))
    img = img_to_array(img)
    img = preprocess_input(img)
    return img

def extract_features(img_path):
    img = preprocess_image(img_path)
    features = inception.predict(img.reshape((1, *img.shape[:3])))
    return features

Text Preprocessing

def tokenize_captions(captions, num_words=5000):
    tokenizer = Tokenizer(num_words=num_words, oov_token='<unk>', filters='!"#$%&()*+.,-/:;=?@[\]^_`{|}~ ')
    tokenizer.fit_on_texts(captions)
    tokenizer.word_index['<pad>'] = 0
    tokenizer.index_word[0] = '<pad>'
    return tokenizer

def generate_input_output_pairs(images, captions, tokenizer, max_len):
    input_images, input_captions, output_captions = [], [], []

    for img, cap in zip(images, captions):
        tokenized_cap = tokenizer.texts_to_sequences([cap])[0]

        for i in range(1, len(tokenized_cap)):
            # Truncate or pad the captions to maintain uniformity in length
            input_cap = pad_sequences([tokenized_cap[:i]], maxlen=max_len, padding='post')[0]
            output_cap = tokenized_cap[i]

            input_images.append(img)
            input_captions.append(input_cap)
            output_captions.append(output_cap)

    return input_images, input_captions, output_captions

Building the Model

Now that we have prepared our data, let's build our image captioning model with attention mechanisms. We will be using the BahdanauAttention layer available in TensorFlow.

from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Dense, Embedding, LSTM, Add, Concatenate
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
from tensorflow.keras.layers.attention import BahdanauAttention

# Define the model architecture
img_input = Input(shape=(2048,))
caption_input = Input(shape=(None,))

embedding = Embedding(len(tokenizer.word_index) + 1, 256)
context_vector, attention_weights = BahdanauAttention(units=128)((embedding(caption_input), Dense(256)(img_input))

x = LSTM(256, activation='tanh', return_sequences=True)(caption_input)
x = Concatenate()([x, context_vector])
x = LSTM(512, activation='tanh')(x)
output = Dense(len(tokenizer.word_index)+1, activation='softmax')(x)

model = Model(inputs=[img_input, caption_input], outputs=output)
model.compile(optimizer='rmsprop', loss='sparse_categorical_crossentropy')

Training the Model

Now that we have constructed our model, it's time to train it using our preprocessed data. Make sure to split your data into training and validation sets to evaluate the model's performance.

model.fit([input_images, input_captions], output_captions,
          epochs=30, batch_size=64,
          validation_split=0.2)

Conclusion

Congratulations, you have implemented an image captioning model with attention mechanisms using TensorFlow and Keras. This model can now be used to generate textual descriptions for unseen images. Keep experimenting with different model architectures and dataset configurations to improve its performance and explore new possibilities. Happy coding!