Entropy estimators and predictability
In previous posts I discussed the local uncertainty and the block entropy
. We also saw the rapid decrease in
uncertainty -- this is due to sampling errors. With larger n our empirical probability estimate
gets worse because it would require more samples to "fill up the histogram", i.e. there's missing ngrams and the seen ngrams have a bad probability estimate.
There's a vast number of papers and techniques on reducing the bias and variance on entropy estimates and I decided to write a few posts about this, with the aim to find the best entropy estimators for our (local) uncertainty measure. With a suitable entropy estimator we will be able to analyse local predictabilities conditioned on larger number of previous symbols with higher significance.
The estimator we used so far is called "plug-in" or maximum likelihood estimator and is defined as
where , so the number of occurrences of the word x in the whole space. It is well known that the MLE estimator is negatively biased. What does that mean?
Bias is the difference between the expected value of the estimator and the true value, i.e.
Thus it is a measure of how accurate the estimator is. The negative bias of MLE means that it underestimates the true entropy of a system. Let us check if we can find this behaviour in data -- theory is nice but data is what counts.
We need to set up a system where we can easily check MLE with the true entropy, so it's easiest if we just sample from a uniform distribution with, e.g. 256 different observations, thus the true probability is for each observation; it easily follows that the entropy is then just
.
The function biassim below calculates the bias through steps numbers of samples with replacement from a pool of 100'000 uniformly distributed integers between 0 and 256; it then calculates the entropy of the sample. Also notice here that the number of samples and number of possible words are equal, which is obviously not the case in general. Finally, we subtract the true entropy value trueEV from the estimates.
biassim <- function(n, m, steps) {
trueEV <- log2(m)
uniformM<-round(runif(100000, min=0, max=m))
f <- function(i) {
sampleN <- sample(uniformM, n, replace = TRUE)
probN <- sampleN / sum(sampleN)
entropy(probN[probN > 0])
}
accu <- sapply(1:steps, f)
return(accu - trueEV)
}
We apply the function above as follows:
> r <- biassim(256,256,2000) > hist(r,breaks=100,xlim=0:-1,main="Bias and variance of MLE")
The histogram should like similar to the plot below.

It shows the negative bias very nicely, with the red vertical line being the true value to be estimated.
If our naive MLE estimator has this systematic underestimation then surely we can easily fix this issue by "shifting" the estimates to the positive side? Yes and no. The simplest bias correction is known under the Miller-Maddow correction and it suggests adding to the estimate. The only problem is that generally m, the number of possible words, is unknown, so it would have to be estimated as well.
In the next post we will apply the Miller-Maddow correction with a few other known estimators and analyse which one is most suitable for our purposes.