BERT: Working with Long Inputs

Ahmed Salem Elhady
6 min readJan 18, 2021

Analysis of handling long documents with BERT model

In this article, we investigate the methods for using long input documents with BERT model. BERT, short for Bidirectional Encoder Representations from Transformers, proved its worthiness as one of, if not the most, famous off-the-shelf pre-trained language models that can help improve several downstream tasks in NLP.

While the pursuit of perfection is through continuous improvement, the amazing model does suffer from a limitation: It can handle input sequences up to only 512 tokens long. This limitation burdens some NLP tasks, especially the tasks where the input is necessarily long, like phone call transcript analysis as in Customer Satisfaction Prediction, and Document Topic Identification.

In this article, we gather information from different sources in a dedicated investigation about methods that tackle such limitations, and we analyze and compare them.

Understanding the Problem

Before we point any fingers at BERT itself, let’s understand why the problem occurs. BERT’s architecture is inherited from the transformers’, which uses self-attention, feed-forward layers, residual connections, and layer normalization as the main building blocks.

Figure 1. Transformer Model Architecture

When it comes to long inputs, the model suffers from a couple of problems:

  1. The memory complexity of the self-attention model is quadratic in the length of the input sequence n, O(n²). This makes the fine-tuning of the model heavy in resource requirements for long inputs.
  2. The way pre-trained BERT learned its positional encoding is highly affected by the limited length of the sequences used in its pre-training, which means that it won’t likely be able to generalize well to positions beyond those seen in the training data.
  3. For the BERT model training, the authors noticed a degradation in performance when they used input sequences beyond 512 tokens long. This is due to the autoregression nature of the transformers themselves.

While the second problem is usually argued that fine-tuning the model with longer inputs should help to overcome this issue, the first problem implicitly limits actions that can be made to solve the second. We will address both problems giving higher concern for the first one.

Long Inputs with BERT

Now let’s discuss the methods people usually follow to use long inputs with BERT.

Trimming Input Sequence

The first method that people usually use is to pad and trim their input sequences to a maximum length. Padding does not cause a problem, you simply mask out the padded part from your output. However, selecting the sequence block out of the document is tricky especially for long documents.

Let’s take the task of document topic identification for example. For coherent documents, like research papers or single-topic news articles, selecting a sequence block randomly from the document should not cause a problem since they are well written on a well-defined topic. But when it comes to incoherent documents, like phone call transcripts, the conversation may have blocks in the middle irrelevant to the main topic. This is also an issue for a problem like the problem of Customer Satisfaction Prediction where the input is the long customer’s call transcript. The customer may be angry at the beginning and becomes satisfied at the end, or randomly vise versa.

Averaging Segment Outputs’ Votes

The second technique that people follow is to divide the long document into overlapping fixed-length segments of the document and use voting for their classification. This solution was encouraged since it overcomes the problem of the previous solutions: the fact that segments may have different information, and so using votes from different segments incorporate information from the entire document.

Figure 2. Segmenting Document into k-chunks

What this method does is divides the document into k-segments, each of which is of length n, where n ≤ MAX_SEQ_LENGTH, run each segment over BERT and gets its classification logits. We then combine the k-votings to produce the final classification. Combining the segments’ results is usually made by averaging or majority votes. Figure 2 shows an example of document segmentation.

The problem with this approach is that you cannot fine-tune BERT over your task since the loss is not differentiable. Furthermore, you are missing shared information between the segments of the document, either they are attention spans or other contextual information encoded in the output vectors. These issues can be very impairing the overall performance.

Recurrence/Transformer Over BERT

The idea of segmenting the document is quite interesting and can be modified to achieve a better solution for long documents. The idea was introduced by Pappagari et al. where they suggest two models over BERT. The two models are quite similar and we discuss both of them.

The architecture follows the previous approach’s document segmentation. It divides the document into k-documents and passes each segment over BERT. The pooled output, vector H in Figure 3, and the logits, vector P in Figure 3, are used as representations for each segment, they are passed along to either an LSTM model, hence RoBERT, or lightweight transformer, hence ToBERT. This architecture not only allows the model to fine-tune BERT over the task, but it also improves memory complexity.

Figure 3. Bert Model

Using a sequence of length n and the document is divided into k-segments, it reduces the computation complexity to:

  • RoBERT: the overall architecture memory complexity becomes: O((n/k)*k² + k) ~ O(n*k)
  • ToBERT: the overall architecture memory complexity becomes:
    O(n²/k² )

Both complexities are an order of magnitude lower than the quadratic complexity of BERT itself when k<<n for RoBERT, and n/k << n for ToBERT.

Comparing RoBERT and Averaging Votes

We replicated the experiment introduced by Pappagari et al. to compare RoBERT and Averaging techniques.

Customer Satisfaction Prediction

The experiment conducted by the paper was to compare three methods of Averaging, we used only two of them:

  1. Majority voting
  2. Averaging pooled output

Results came very near those reported by the paper as seen in Table 1. The results show that RoBERT achieved better results over direct results due to the exploitation of parameters’ tuning.

Table 1. CSAT results

Long Document Topic Identification

We also experimented with the suggested architecture for topic modeling of long documents following Armand Oliver’s blog post. We used the US Consumer Finance Complaints’ consumer calls transcript as the long input documents and the product as the topic class, finetuned over BERT-Large, not BERT-base like the blog did.

We achieved significantly improved results over approaches we are already familiar with, reaching almost 89% accuracy with 2 epochs for fine-tuning, and ~91% with 5 epochs. Increasing the number of epochs beyond that did not show notable improvements.

Fisher Topic Identification

This is another experiment for Topic Identification. The authors report their results for topic identification on the Fisher Phase 1 corpus. They compare the results of BERT output averaging (averaging segments’ predictions) with ToBERT over different document length ranges. They plot the average accuracy per range for both approaches to visualize the performance gain by ToBERT as shown in Figure 4. It is seen that over all of the ranges, ToBERT outperforms the averaging technique. Furthermore, its classification performance is more consistent over increasing lengths of documents than average voting techniques.

Figure 4. ToBERT vs. Average Voting on Fisher corpus for Topic Identification

Side Notes on The Experiments

The authors reported that ToBERT model exploited the pre-trained BERT features better than RoBERT, however, we believe that this should be further justified experimentally.

Conclusion

Our investigation concluded that using BERT for long input depends on the nature of your task. In the complicated tasks that require information from the entire documents, using recurrence/transformers over BERT seems to be a promising solution.

References

  1. Raghavendra Pappagari, Piotr Zelasko, Jesus Villalba, Yishay Carmiel, and Najim Dehak. 2019. Hierarchical transformers for long document classification. arXiv preprint arXiv:1910.10781.

--

--

Ahmed Salem Elhady

NLP - Applied Data Scientist || @ Microsoft. MSc and TA @Zewail City of Science, Technology, and Innovation.