What is TorchText?

TorchText is a pytorch package that contains different data processing methods as well as popular NLP datasets. According to the official PyTorch documentation, torchtext has 4 main functionalities: data, datasets, vocab, and utils. Data is mainly used to create custom dataset class, batching samples etc. Datasets consists of the various NLP datasets from sentiment analysis to question answering. Vocab covers different methods of processing text and utils consists of additional helper functions.

What are some of the things that TorchText can do?

  1. Read and tokenise data

  2. Map words to unique integers

  3. Numericalise words

  4. Load the data into any deep learning framework

  5. Do any further preprocessing such as padding

What’s the standard procedure of using TorchText to feed data into neural networks?

  1. Use torchtext.data.Dataset to read, process, and numericalise data

  2. Use torchtext.data.Iterator to batch and pad your data and move it to GPU for training neural network

What’s Field in TorchText.data?

Field are basically ways in which you specify how you want certain fields to be processed. For example, you can create a Text field that requires you to tokenise, lowercase, and numericalise and a Label field that’s already in numerical form and so doesn’t require the same level of processing.

How does Field works with torchtext.data.Dataset?

Once you have identified different fields (different ways to process different types of variables), you can load the datasets and for each column, assign the column to the most appropriate fields so that your dataset can be process correctly. Once you have assign each column to their respective fields, you can pass these datafields into the torchtext.data.Dataset and create your train, val, and test split.

How to build your vocabulary using torchtext?

Using the TEXT field that you have created, you can use the method build_vocab() and pass in the training data so that it will learn the full range of words. Torchtext’s vocab class has stoi (string to int) and itos (int to string) attribute. The vocab class can also build different embedding matrix using pre-trained embeddings.

What’s the difference between DataLoaders and Iterator?

There are both the same except Iterator has some convenient functionality that’s unique to NLP and DataLoaders are used a lot within torchvision and PyTorch.

What is the BucketIterator?

It is one of the most effective features of torchtext. It automatically shuffles and group input sequences of similar length. This is very useful as the amount of padding is determined by the longest sequence in the batch and therefore padding is most efficient when sequences are of similar lengths.

Ryan

Ryan

Data Scientist

Leave a Reply