### Naïve Bayes

The log-count ratio r for each word f is shown below:

Whereby the ratio of feature f in positive documents is the number of positive documents that contains feature f divided by the total number of positive documents. Binarised Naïve Bayes explores the idea that maybe for classification, it only matters whether a word is present in the review or not (rather than the frequency of the word).

### Trigram with NB features

Logistic regression with NB features to do sentiment classification. For every document (review), we will compute binarised feature with unigram, bigrams, and trigrams. Each feature is a log-count ratio (r). The process of creating n-grams training and validation dataset is as below. I have added my comments to the code so that it is easy to follow (or at least I hope it will make it easy to follow).

#### Defining Variables

```
min_n=1
max_n=3
j_indices = []
indptr = []
values = []
indptr.append(0)
num_tokens = vocab_len
itongram = dict()
ngramtoi = dict()
```

#### Creating N-gram features for training matrix

```
# Looping through the training reviews
for i, doc in enumerate(movie_reviews.train.x):
# Saving unigram features
feature_counter = Counter(doc.data)
j_indices.extend(feature_counter.keys())
values.extend(feature_counter.values())
this_doc_ngrams = list()
m = 0
# Loop through the defined minimum and maximum n-gram range to derive n-gram features
for n in range(min_n, max_n + 1):
for k in range(vocab_len - n + 1):
ngram = doc.data[k: k + n]
if str(ngram) not in ngramtoi:
if len(ngram)==1:
num = ngram[0]
ngramtoi[str(ngram)] = num
itongram[num] = ngram
else:
ngramtoi[str(ngram)] = num_tokens
itongram[num_tokens] = ngram
num_tokens += 1
this_doc_ngrams.append(ngramtoi[str(ngram)])
m += 1
# Saving defined n-gram features (bigram and trigram)
ngram_counter = Counter(this_doc_ngrams)
j_indices.extend(ngram_counter.keys())
values.extend(ngram_counter.values())
indptr.append(len(j_indices))
train_ngram_doc_matrix = scipy.sparse.csr_matrix((values, j_indices, indptr),
shape=(len(indptr) - 1, len(ngramtoi)),
dtype=int)
```

#### Create validation matrix

```
j_indices = []
indptr = []
values = []
indptr.append(0)
for i, doc in enumerate(movie_reviews.valid.x):
feature_counter = Counter(doc.data)
j_indices.extend(feature_counter.keys())
values.extend(feature_counter.values())
this_doc_ngrams = list()
m = 0
for n in range(min_n, max_n + 1):
for k in range(vocab_len - n + 1):
ngram = doc.data[k: k + n]
if str(ngram) in ngramtoi:
this_doc_ngrams.append(ngramtoi[str(ngram)])
m += 1
ngram_counter = Counter(this_doc_ngrams)
j_indices.extend(ngram_counter.keys())
values.extend(ngram_counter.values())
indptr.append(len(j_indices))
valid_ngram_doc_matrix = scipy.sparse.csr_matrix((values, j_indices, indptr),
shape=(len(indptr) - 1, len(ngramtoi)),
dtype=int)
```

#### Use Scipy to save and load matrices and use pickle to save the dictionaries

```
# Saving and loading scipy matrices
scipy.sparse.save_npz("train_ngram_matrix.npz", train_ngram_doc_matrix)
train_ngram_doc_matrix = scipy.sparse.load_npz("train_ngram_matrix.npz")
# Saving dictionaries
with open('itongram.pickle', 'wb') as handle:
pickle.dump(itongram, handle, protocol=pickle.HIGHEST_PROTOCOL)
```

Using dictionaries to convert between indices and strings is a common & useful hack!