Introduction

Fastai library is built on top of PyTorch and encodes many state-of-the-art best practices. It also contains lots of datasets. In this lecture, we will be exploring the IMDB dataset.

A good practice is to always work on a sample set of your data before using the full dataset. This allows you to perform quicker iterations and get your code working. In addition, a good first step for any data problem is to explore the dataset and get a good idea of what the data looks like! In this case, we have movie reviews and each are labeled either “positive” or “negative”.

TextList object (from fastai library)

  • It is a basic ItemList for text data

  • It has attribute train, valid, and vocab

  • For attribute vocab, it has one list (itos) and one dictionary (stoi)
    • stoi – string to integer

    • itos – integer to string

    • stoi dictionary might have a longer length than itos list simply because it is a many-to-one mapping. In this essence, itos stores all the integer keys of the unique words in the dataset whereas in the stoi dictionary, there might be multiple occurrence of the same word that are mapped to the same integer

    • The reason itos is a list (rather than a dictionary) is because the index of the list acts as the key of the strings already. A list uses less memory space than a dictionary

Creating term-document matrix (manually using Scipy)

In previous lecture, we use sklearn’s CountVectorizer to compute the term-document matrix but here, we will learn how to create term-document matrix manually using Counters and Sparse Matrices.

Counters – Pass a list into a Counter and it will output a dictionary of unique keys mapped to the number of occurrence in the list (values).

Sparse Matrices – A matrix with lots of zeros (opposite of a dense matrix). For sparse matrices, you can save a lot of memory by only storing the non-zero values.

  • Three most common sparse storage formats
    • Coordinate-wise (COO)
      • Stores three variables: column indexes, row indexes, and corresponding values

    • Compressed sparse row (CSR)
      • Stored column indexes and values as COO but instead of storing row indexes, it stores rowptr, where each rowptr represents the row index of a new row. This is shown in the figure below:

      • With this storage format, memory accesses is reduced by exactly 2 and it’s great for accessing data by row

    • Compressed sparse column (CSC)
      • Same as CSR but column-wise

Scipy is great at converting sparse matrices into different storage formats (linear-time operations). Below is a method implementation of term-document matrix:

def get_term_doc_matrix(label_list, vocab_len):
    j_indices = []
    indptr = []
    values = []
    indptr.append(0)

    for i, doc in enumerate(label_list):
        # Build a dictionary of word frequencies of each review using Counter
        feature_counter = Counter(doc.data)

        # Save the words in j_indices array and the frequencies in values array
        # indptr stores the index of a new row (new review in our case)
        j_indices.extend(feature_counter.keys())
        values.extend(feature_counter.values())
        indptr.append(len(j_indices))
        
    # Building a CSR matrix
    return scipy.sparse.csr_matrix((values, j_indices, indptr),
                                   shape=(len(indptr) - 1, vocab_len),
                                   dtype=int)
Ryan

Ryan

Data Scientist

Leave a Reply