Naïve Bayes

The log-count ratio r for each word f is shown below:

Whereby the ratio of feature f in positive documents is the number of positive documents that contains feature f divided by the total number of positive documents. Binarised Naïve Bayes explores the idea that maybe for classification, it only matters whether a word is present in the review or not (rather than the frequency of the word).

Trigram with NB features

Logistic regression with NB features to do sentiment classification. For every document (review), we will compute binarised feature with unigram, bigrams, and trigrams. Each feature is a log-count ratio (r). The process of creating n-grams training and validation dataset is as below. I have added my comments to the code so that it is easy to follow (or at least I hope it will make it easy to follow).

Defining Variables

min_n=1
max_n=3

j_indices = []
indptr = []
values = []
indptr.append(0)
num_tokens = vocab_len

itongram = dict()
ngramtoi = dict()

Creating N-gram features for training matrix

# Looping through the training reviews
for i, doc in enumerate(movie_reviews.train.x):
    # Saving unigram features
    feature_counter = Counter(doc.data)
    j_indices.extend(feature_counter.keys())
    values.extend(feature_counter.values())
    this_doc_ngrams = list()

    m = 0
    # Loop through the defined minimum and maximum n-gram range to derive n-gram features
    for n in range(min_n, max_n + 1):
        for k in range(vocab_len - n + 1):
            ngram = doc.data[k: k + n]
            if str(ngram) not in ngramtoi:
                if len(ngram)==1:
                    num = ngram[0]
                    ngramtoi[str(ngram)] = num
                    itongram[num] = ngram
                else:
                    ngramtoi[str(ngram)] = num_tokens
                    itongram[num_tokens] = ngram
                    num_tokens += 1
            this_doc_ngrams.append(ngramtoi[str(ngram)])
            m += 1
    
    # Saving defined n-gram features (bigram and trigram)
    ngram_counter = Counter(this_doc_ngrams)
    j_indices.extend(ngram_counter.keys())
    values.extend(ngram_counter.values())
    indptr.append(len(j_indices))

train_ngram_doc_matrix = scipy.sparse.csr_matrix((values, j_indices, indptr),
                                   shape=(len(indptr) - 1, len(ngramtoi)),
                                   dtype=int)

Create validation matrix

j_indices = []
indptr = []
values = []
indptr.append(0)

for i, doc in enumerate(movie_reviews.valid.x):
    feature_counter = Counter(doc.data)
    j_indices.extend(feature_counter.keys())
    values.extend(feature_counter.values())
    this_doc_ngrams = list()

    m = 0
    for n in range(min_n, max_n + 1):
        for k in range(vocab_len - n + 1):
            ngram = doc.data[k: k + n]
            if str(ngram) in ngramtoi:
                this_doc_ngrams.append(ngramtoi[str(ngram)])
            m += 1

    ngram_counter = Counter(this_doc_ngrams)
    j_indices.extend(ngram_counter.keys())
    values.extend(ngram_counter.values())
    indptr.append(len(j_indices))


valid_ngram_doc_matrix = scipy.sparse.csr_matrix((values, j_indices, indptr),
                                   shape=(len(indptr) - 1, len(ngramtoi)),
                                   dtype=int)

Use Scipy to save and load matrices and use pickle to save the dictionaries

# Saving and loading scipy matrices
scipy.sparse.save_npz("train_ngram_matrix.npz", train_ngram_doc_matrix)
train_ngram_doc_matrix = scipy.sparse.load_npz("train_ngram_matrix.npz")

# Saving dictionaries
with open('itongram.pickle', 'wb') as handle:
    pickle.dump(itongram, handle, protocol=pickle.HIGHEST_PROTOCOL)

Using dictionaries to convert between indices and strings is a common & useful hack!

Ryan

Ryan

Data Scientist

Leave a Reply