Given the recent traction on GPT-3 and how language models have taken over the development of many NLP areas, I decided to learn in more detail how to build a language model. Today’s post covers my experience building a language model from scratch, highlighting 4 main steps: covering data collection, training BPE tokeniser, training the language model from scratch, and testing the language model.

Data collection

The first step to building our language model is to collect a large corpus of text data. In the Hugging Face tutorial I was following, we chose the language to be Esperanto and so we collected corpus of text data in Esperanto language to feed into our model. As recommended, I downloaded the data from the OSCAR corpus as well as the Leipzig Corpora Collection, which covers different types of documents such as news, wikipedia, and literature in Esperanto language. I wrote a script to merge all these data files together. The script is as follows:

import pandas as pd
from pathlib import Path

paths = [str(x) for x in Path("./data/").glob("**/*.txt")]

dataframe = []
for i in range(1, len(paths)):
    df = pd.read_csv(paths[i], sep = "\t", header = None)

overall_df = pd.concat(dataframe)
overall_df = overall_df[1]

overall_df.to_csv('new_data.txt', header=None, index=None, mode='a')

data = data2 = "" 
# Reading data from file1 
with open('new_data.txt') as fp: 
    data = 
# Reading data from file2 
with open('./data/eo.txt') as fp: 
    data2 = 
# Merging 2 files 
# To add the data of file2 
# from next line 
data += "\n"
data += data2 
with open ('final_data.txt', 'w') as fp: 

Training BPE Tokeniser

Here, the goal is to build two files: vocab.json and merges.txt. The vocab.json is a list of the top K tokens found in the text corpus that you built in the previous step map to their respective token ids. Since we are building a byte pair encoding (BPE) tokeniser, the merges.txt allows us to perform subword tokenisation on our input text. We would be using the tokenizer from the Hugging Face library as shown below:

from pathlib import Path

from tokenizers import ByteLevelBPETokenizer

data = './data/final_data.txt'

# Initialize a tokenizer
tokenizer = ByteLevelBPETokenizer()

# Customize training
tokenizer.train(files=data, vocab_size=52_000, min_frequency=2, special_tokens=[

# Save files to disk"./esperberto")

Training our Language Model from scratch

Now that we have our trained tokeniser and vocabulary, we are ready to start training our language model. There are two steps: loading our dataset and setting our model and training configurations.

Load dependencies
import torch
from pathlib import Path
from tokenizers.implementations import ByteLevelBPETokenizer
from tokenizers.processors import BertProcessing

from transformers import RobertaConfig
from transformers import RobertaForMaskedLM
from transformers import RobertaTokenizerFast
from transformers import DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments

from import Dataset
Our Dataset Class

We have our own trained tokeniser and corpus of text. We created a new dataset class to process our corpus of text using our trained tokeniser so that they are ready to be use for our language model. The dataset class is shown below:

class EsperantoDataset(Dataset):
    def __init__(self, evaluate: bool = False):
        tokenizer = ByteLevelBPETokenizer(
        tokenizer._tokenizer.post_processor = BertProcessing(
            ("</s>", tokenizer.token_to_id("</s>")),
            ("<s>", tokenizer.token_to_id("<s>")),
        # or use the RobertaTokenizer from `transformers` directly.

        self.examples = []

        src_files = Path("./data/").glob("*-eval.txt") if evaluate else Path("./data/").glob("final_data.txt")
        for src_file in src_files:
            print("🔥", src_file)
            lines = src_file.read_text(encoding="utf-8").splitlines()
            self.examples += [x.ids for x in tokenizer.encode_batch(lines)]

    def __len__(self):
        return len(self.examples)

    def __getitem__(self, i):
        # We’ll pad at the batch level.
        return torch.tensor(self.examples[i])
Configuring our language model and training

We need to configure two things: our language model architecture (RoBERTa in this case) and training settings. The configurations are shown below, which it’s taken straight from the tutorial:

config = RobertaConfig(
training_args = TrainingArguments(
Initialising and Train our Language Model

Once we have the configurations, we can use those to initialise our language model, RoBERTa and our trainer. We would then proceed with training our language model using Trainer as shown below:

model = RobertaForMaskedLM(config=config)
tokenizer = RobertaTokenizerFast.from_pretrained("./esperberto", max_len=512)

dataset = EsperantoDataset()
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True, mlm_probability=0.15)
trainer = Trainer(


Testing our Language Model

Once we have trained and save our language model, we can use it to predict the next token in sequence as shown below. You can see the output of each query below 🙂

from transformers import pipeline

fill_mask = pipeline("fill-mask", model = './esperberto', tokenizer = './esperberto')

fill_mask("La suno .")
# [{'score': 0.02119220793247223,
#   'sequence': 'La suno estas.',
#   'token': 316},
#  {'score': 0.012403824366629124,
#   'sequence': 'La suno situas,
#   'token': 2340}]

fill_mask("Jen la komenco de bela .")
# [{'score': 0.01814725436270237,
#   'sequence': 'Jen la komenco de bela urbo.',
#   'token': 871},
#  {'score': 0.015888698399066925,
#   'sequence': 'Jen la komenco de bela vivo.',
#   'token': 1160}]


Data Scientist

Leave a Reply