Naïve Bayes in NLP

The “naïve” part of NB is that we assume the probabilities of different words to be independent.

Derivation of Naïve Bayes

The normal calculation of Naïve Bayes (above) isn’t very scalable as the size of vocabulary increases. The probability of each word condition on whether it’s spam or not spam is between 0 and 1. As the vocabulary increases, you will start to multiply more of those probabilities together which makes the overall product starts to approach zero (numerical stability). When the number gets extremely small, it might not be able to be represented in a double or long double type variable.

Log-probabilities for Naïve Bayes

This is why we often represent probabilities as log-probabilities. This transformation helps as follows:

  1. Log-probability is always between -infinity to 0 (a much wider range than 0 and 1)

  2. If a < b, then so is ln(a) < ln(b) and so the comparison between probabilities and log-probabilities are the same

  3. The logarithm of a product is the sum of the logarithms as shown above and the resulting sums are manageable numbers

Numerical Stability

The problem of numerical stability is demonstrated above. This problem exist because math is continuous and infinite whereas computers are discrete and finite. This is connected to how computer store numbers (as shown below). Floats and doubles don’t necessarily cover all the real representations of number. As real number gets bigger, floats and doubles will start to use the closes floating point approximation in the system From your perspective, the number might be the same but in reality, it’s not and the difference is minimal. This could cause problem if you are looping through the same operations as that minimal difference error will propagate.

Cool computer science knowledge – speed of different types of memory

  • L1 cache reference 0.5 ns
  • L2 cache reference 7 ns
  • Main memory reference/RAM 100 ns
  • Send 2K bytes over 1 Gbps network 20,000 ns
  • Read 1 MB sequentially from memory 250,000 ns
  • Round trip within same datacenter 500,000 ns
  • Disk seek 10,000,000 ns
Ryan

Ryan

Data Scientist

Leave a Reply