Introduction
Fastai library is built on top of PyTorch and encodes many stateoftheart best practices. It also contains lots of datasets. In this lecture, we will be exploring the IMDB dataset.
A good practice is to always work on a sample set of your data before using the full dataset. This allows you to perform quicker iterations and get your code working. In addition, a good first step for any data problem is to explore the dataset and get a good idea of what the data looks like! In this case, we have movie reviews and each are labeled either “positive” or “negative”.
TextList object (from fastai library)

It is a basic ItemList for text data

It has attribute train, valid, and vocab
 For attribute vocab, it has one list (itos) and one dictionary (stoi)

stoi – string to integer

itos – integer to string

stoi dictionary might have a longer length than itos list simply because it is a manytoone mapping. In this essence, itos stores all the integer keys of the unique words in the dataset whereas in the stoi dictionary, there might be multiple occurrence of the same word that are mapped to the same integer

The reason itos is a list (rather than a dictionary) is because the index of the list acts as the key of the strings already. A list uses less memory space than a dictionary

Creating termdocument matrix (manually using Scipy)
In previous lecture, we use sklearn’s CountVectorizer to compute the termdocument matrix but here, we will learn how to create termdocument matrix manually using Counters and Sparse Matrices.
Counters – Pass a list into a Counter and it will output a dictionary of unique keys mapped to the number of occurrence in the list (values).
Sparse Matrices – A matrix with lots of zeros (opposite of a dense matrix). For sparse matrices, you can save a lot of memory by only storing the nonzero values.
 Three most common sparse storage formats
 Coordinatewise (COO)

Stores three variables: column indexes, row indexes, and corresponding values

 Compressed sparse row (CSR)

Stored column indexes and values as COO but instead of storing row indexes, it stores rowptr, where each rowptr represents the row index of a new row. This is shown in the figure below:

With this storage format, memory accesses is reduced by exactly 2 and it’s great for accessing data by row

 Compressed sparse column (CSC)

Same as CSR but columnwise

 Coordinatewise (COO)
Scipy is great at converting sparse matrices into different storage formats (lineartime operations). Below is a method implementation of termdocument matrix:
def get_term_doc_matrix(label_list, vocab_len):
j_indices = []
indptr = []
values = []
indptr.append(0)
for i, doc in enumerate(label_list):
# Build a dictionary of word frequencies of each review using Counter
feature_counter = Counter(doc.data)
# Save the words in j_indices array and the frequencies in values array
# indptr stores the index of a new row (new review in our case)
j_indices.extend(feature_counter.keys())
values.extend(feature_counter.values())
indptr.append(len(j_indices))
# Building a CSR matrix
return scipy.sparse.csr_matrix((values, j_indices, indptr),
shape=(len(indptr)  1, vocab_len),
dtype=int)