 ### Naïve Bayes

The log-count ratio r for each word f is shown below: Whereby the ratio of feature f in positive documents is the number of positive documents that contains feature f divided by the total number of positive documents. Binarised Naïve Bayes explores the idea that maybe for classification, it only matters whether a word is present in the review or not (rather than the frequency of the word).

### Trigram with NB features

Logistic regression with NB features to do sentiment classification. For every document (review), we will compute binarised feature with unigram, bigrams, and trigrams. Each feature is a log-count ratio (r). The process of creating n-grams training and validation dataset is as below. I have added my comments to the code so that it is easy to follow (or at least I hope it will make it easy to follow).

#### Defining Variables

``````min_n=1
max_n=3

j_indices = []
indptr = []
values = []
indptr.append(0)
num_tokens = vocab_len

itongram = dict()
ngramtoi = dict()``````

#### Creating N-gram features for training matrix

``````# Looping through the training reviews
for i, doc in enumerate(movie_reviews.train.x):
# Saving unigram features
feature_counter = Counter(doc.data)
j_indices.extend(feature_counter.keys())
values.extend(feature_counter.values())
this_doc_ngrams = list()

m = 0
# Loop through the defined minimum and maximum n-gram range to derive n-gram features
for n in range(min_n, max_n + 1):
for k in range(vocab_len - n + 1):
ngram = doc.data[k: k + n]
if str(ngram) not in ngramtoi:
if len(ngram)==1:
num = ngram
ngramtoi[str(ngram)] = num
itongram[num] = ngram
else:
ngramtoi[str(ngram)] = num_tokens
itongram[num_tokens] = ngram
num_tokens += 1
this_doc_ngrams.append(ngramtoi[str(ngram)])
m += 1

# Saving defined n-gram features (bigram and trigram)
ngram_counter = Counter(this_doc_ngrams)
j_indices.extend(ngram_counter.keys())
values.extend(ngram_counter.values())
indptr.append(len(j_indices))

train_ngram_doc_matrix = scipy.sparse.csr_matrix((values, j_indices, indptr),
shape=(len(indptr) - 1, len(ngramtoi)),
dtype=int)``````

#### Create validation matrix

``````j_indices = []
indptr = []
values = []
indptr.append(0)

for i, doc in enumerate(movie_reviews.valid.x):
feature_counter = Counter(doc.data)
j_indices.extend(feature_counter.keys())
values.extend(feature_counter.values())
this_doc_ngrams = list()

m = 0
for n in range(min_n, max_n + 1):
for k in range(vocab_len - n + 1):
ngram = doc.data[k: k + n]
if str(ngram) in ngramtoi:
this_doc_ngrams.append(ngramtoi[str(ngram)])
m += 1

ngram_counter = Counter(this_doc_ngrams)
j_indices.extend(ngram_counter.keys())
values.extend(ngram_counter.values())
indptr.append(len(j_indices))

valid_ngram_doc_matrix = scipy.sparse.csr_matrix((values, j_indices, indptr),
shape=(len(indptr) - 1, len(ngramtoi)),
dtype=int)``````

#### Use Scipy to save and load matrices and use pickle to save the dictionaries

``````# Saving and loading scipy matrices
scipy.sparse.save_npz("train_ngram_matrix.npz", train_ngram_doc_matrix) 