Convolutional Neural Network (CNNs) are typically use in computer vision. Kim (2014) was one of the first few papers to apply CNNs to NLP, where the paper focused on the sentence classification task (Kim, 2014). To apply CNNs or any other type of neural network to NLP tasks, we must first translate our inputs (sentences / documents) into a matrix, where each row vector represents a token (usually a word but it could be a character). These row vectors are usually word embeddings learned from word2vec, GloVe or FastText. For a 7 word sentence using 5-dimensional embedding, we would have a 7×5 matrix as our input.
Once we have our input matrix, we apply convolution filters to it in order for our model to detect features. When applied to NLP tasks, the filters are usually move in one dimension, where only the height of the filters varies. This is because given that each row vector represents a word, it only makes sense for the width of the filters to match the width of the input matrix, in our example, all filters have a width of 5. This is in contrast to CNN applied to images where the filters move in two dimensions, whereby both the width and height of the filter can vary, allowing the filter to focus on different part of an image. In the figure above, there are various region sizes (height of filter) of 2, 3 and 4. The different filter size can be view as different n-grams captured by the CNN. For example, filter size of 2 allows the model to capture bi-grams. This is useful as sentences often contain phrases that involves using multiple words together to represent different meanings. Every filter performs convolution on the sentence matrix and generates feature maps of different size. In our figure example, we have two filters per region size, meaning that for each region size, we generate two feature maps, totalling 6 feature maps. With each feature map, we perform max-pooling in order to extract the most important feature, creating 6 univariate feature vectors. These 6 univariate vectors are then concatenated to form a singele feature vector, which then serve as an input to a fully-connected layer, which usually includes different regularisation technique and a softmax activation function in the output layer.