
BERT – A Perfect Gift For The New Year
It has been six decades since the first NLP algorithm was tested and I believe it is safe to say that our understanding of representing various complex words/sentences in forms that best capture the underlying meanings and syntactic relationships is only getting better over time. The data science team at WatermelonBlock has embraced experimentation in the sense that we started working on implementing deep text mining algorithms just so the accuracy rates are increased from time-to-time thereby resulting in an efficient data pipeline. A pipeline that could crunch the massive natural language corpora that we acquire from the web and attain the bias-variance trade off with ease. Our partners at IBM work with us on leveraging the IBM PowerAI servers for conducting these experiments and get the desired results.
The team has been getting their hands dirty with some of the incredible natural language processing models ever developed in order to serve our customers with a smart-yet-generalized real-time insights application. One such model is the trending open-source NLP model – ‘BERT’ developed by Google recently. The initial results that we obtained while testing out the BERT framework for our application showed that there was a 20% reduction in the classification error and just took 4 hours to train the ensemble model with an unstructured dataset comprising over 1.4 million examples.
It was just too good to be true but yeah, it was one of the best results that we have recorded so far. A perfect New Year gift, any data science team would crave for. In fact, we have already fallen in love with the framework and couldn’t wait to deploy it to production.
So what’s BERT and how does it work?
BERT a.k.a Bidirectional Encoder Representations from Transformer is a new method of pre-training language representations which gives out state-of-the-art results on a wide array of natural language processing tasks like inter-lingual machine translation and fact-checking. There are a couple of concepts one needs to be aware of, so as to understand what BERT is. Let’s start by looking at the concepts behind the model’s foundation.
The Innards of BERT:
BERT is basically a transformer encoder stack with an attentive mechanism.
If you don’t know what an attention mechanism is, this awesome article by WildML will get you up to speed! On the other hand, a transformer is a deep learning model that speeds up the training process of the entire stack by representing the input data into a suitable form. Though, the transformer consists of two significant components namely – encoding component and decoding component interconnected with one another, BERT uses only the encoding component for its operations.
The following diagram is a simple black-box case of translating a French sentence into English using a transformer.

A transformer stack | Source: Illustrated BERT
The encoding component is a stack of encoders (for the sake of giving an example, we take stacks of 6 on top of each other, but one can definitely experiment with other arrangements). Each encoder has a feed-forward network layer and a self-attentive mechanism layer, while each decoder has the same along with an additional layer – encoder-decoder attention.

Encoder & Decoder | Source: Illustrated BERT
The encoder’s inputs initially flow through the self-attention layer – a neural layer that helps the encoder look at other words in the input sentence as it encodes a specific word.
The outputs of the self-attention layer are then fed to a feed-forward layer – a layer that allocates weights and propagates the encoded word vector forward to the next encoding component.
The decoder has both those layers, but between them is an encoder-decoder attention layer that is a safety measure that helps the decoder focus on relevant parts of the given input sentence.
Functional Architecture of BERT:
There are two types of BERT models that have been developed:
1. BERT Base Version: Smaller in size, computationally affordable and not applicable to complex text mining operations.
2. BERT Large Version: Larger in size, computationally expensive and crunches text data only to deliver the best results.

BERT Variants | Source: Illustrated BERT
Both the BERT models have ‘N’ encoder layers (also called transformer blocks) that form a massive data encoding stack. Usually, N = 12 for the base version, and N = 24 for the large version. They have larger feed-forward layers (768 and 1024 hidden units respectively), and self-attention layers (12 and 16 respectively).
For explanatory purposes, let’s use a mini-version of BERT with 6 encoder layers, 512 feed-forward layers, and 8 self-attention layers for the purpose of classifying spam text or from genuine ones.

Encoder Stack | Source: Illustrated BERT
Like a typical encoder in a transformer, BERT takes in a sequence of words as input which keeps flowing up the stack. Each layer applies self-attention, and passes its results through a feed-forward network, and then hands it off to the next encoder. The first input token is a classifier [CLS] that uses transfers the encoded version of the input in that layer to the next encoder layer through a feed-forward network and a softmax function.
As far as the outputs are concerned, each encoder position in the transformer stack gives out a resultant vector of a hidden size of 768 (in the case of BERT Base)

Encoded Output Flow | Source: Illustrated BERT
In our task of spam classification, just a single encoder position is enough. The resultant vector is then sent to a classifier of our choice, which usually turns out to be a single-layer neural network that has a feed-forward network that allocates the final weights and a softmax function for non-linearity. The softmax is a wonderful activation function in the sense that it converts the numerically improper logits into a balanced probabilistic distribution with all the probabilities summing to one.

Probabilistic Classification | Source: Illustrated BERT
If there are more target labels that can automatically sort emails with as a “social post” or “promotional email”, it’s enough that the classifier network is adjusted to have more output neurons (in this case, 4 output neurons) that finally pass through softmax function.
The probability values obtained from the classifier gives the likelihood of the email being spam or otherwise. While spam classification is only a gist of how BERT can easily do it better, complex tasks like fake news detection, social sentiment classification, token review monitoring, etc are all room for play.
Stay tuned to this technical blog series to read exciting upcoming articles that we have in store for you and if you are interested to know more on WatermelonBlock’s engineering culture!
References:
1. “The Illustrated Transformer”, Jay Alammar, 2018, https://jalammar.github.io/illustrated-transformer/
2. “Self-Attention Mechanisms in Natural Language Processing”, https://dzone.com/articles/self-attention-mechanisms-in-natural-language-proc
3. “Open Sourcing BERT: State-Of-The-Art Pre-Training for Natural Language Processing”, Google AI, 2018