Objective and Contribution

Introduced Span-ConveRT, a model for dialog slot-filling task which frames the task as a turn-based span extraction task, allowing the model to leverage the representations of pre-trained language model. We show that Span-ConveRT model is useful for few-shot learning scenarios, for instance, good results from training a span extractor from scratch in a particular target domain. We also released RESTAURANT-8K, a new dataset of actual conversations in the restaurant booking domain. The data set spans 5 different slots: date, time, people, first name, and last name. This is illustrated below:


There are many design rationale behind Span-ConveRT which we will discussed below before introducing the Span-ConveRT architecture.

Span extraction for dialog

Recent existing work has been focusing on tieing intent with slot-filling whereby slots are only defined if they occur with certain intents. This alleviate the problem of modelling different categories but it does restrict the performance of slot-filling to the intent detector. Our framework do not depend on intent classification and we identify slot as a single span of text, making our span extraction component independent of other system components.

Pre-trained representations

Large pre-trained models have improved the performance of a lot of NLP tasks and such pre-trained models also means that we require a relative small amount of domain-specific data to fine-tune our model. However, the current process of fine-tuning a model means that we would need to develop a separate model for every single slot or domain, making it very impractical in real-life applications. Therefore, we propose not to fine-tuned our pre-trained encoder and will only be using a single encoder model for our slot-filling task.


ConveRT is a sentence encoder that models the interaction between inputs and relevant responses. It was pretrained using Reddit data and so through using the pre-trained representations, we can take advantage of the large amount of conversational cues for few-shot span extraction task.

The final model: Span ConveRT

The figure below showcase the model’s architecture. We have ConveRT to encode our sentences. For sequence tagging, we trained a CNN and CRF on top of fixed subword representations. We concatenate four binary features and one integer feature to the subword representations to capture the importance of the text:

  1. Whether the token is alphanumeric

  2. whether the token is numeric

  3. whether the token is the start of a new word

  4. Character length of the token

  5. Whether the slot is request

Each span is represented using a sequence of tags, indicating which subword token sequence are in the span. We tag each sequence with a before, begin, inside, and after tag as shown below.

Experiments and Results

We used the RESTAURANTS-8K and DSTC8 as the evaluation datasets. The summary statistics of the dataset for RESTAURANT-8K is shown below. The DSTC8 dataset contains span annotations for a subset of slots and covers four different domains: bus and coach booking, buying tickets for events, property viewing, and renting cars. Our evaluation metric is F1 scores for extracting the correct span per user utterance.

We have two baseline models: V-CNN-CRF and Span-BERT. V-CNN-CRF learns subword representations from scratch whereas Span-BERT uses fixed BERT subword representations. For both baseline models, we perform hyperparameter tuning to optimise the performance to the dev set of RESTAURANT-8K.

Lastly, we want to investigate if our models can perform well under low-data regimes and so for both datasets, we measure the performance change though continuously decreasing the training sets while maintaining the same test set.


The results for both datasets are showcase below. As shown, our Span-ConveRT model outperformed the baseline models in most of the evaluation scenarios, showcasing the effectiveness of pre-trained and transferred representations. In addition, as we reduce the number of training data, Span-ConveRT’s performance decrease at a slower pace than V-CNN-CRF and Span-BERT, widening the performance gap. This again showcase that conversational knowledge encoded in pre-trained ConveRT can be used to improve dialog modelling in low-data settings.

We also observe that Span-BERT significantly underperformed both Span-ConveRT and V-CNN-CRF. This tells us that for conversational applications, pre-training on conversational task (like ConveRT) is more beneficial than pretraining for language modelling (like BERT).

Error Analysis

We also performed error analysis to better understand the performance of our Span-ConveRT. We looked at four different errors:

  1. Predicted no span when there was a span

  2. Predicted a span when there was no span

  3. Predicted a span which does not overlap the label span

  4. Predicted a span which overlaps the label span

We observed that when the models are trained in high-data setting, the distribution of errors are similar between Span-ConveRT and V-CNN-CRF. However, when trained with on 1/16 of the total training data, the difference is more obvious. Span-ConveRT produces a bigger proportion of type 4 errors on every slot. This means that although Span-ConveRT might not be precisely correct, it can still yield a span that could parse the correct value.

Conclusion and Future Work

Potential future work could involve investigating multi-domain span extraction architectures.



Data Scientist

Leave a Reply