oneminusp.com Computational Finance, Markets, Programming & co

13Feb/100

Entropy estimators and predictability

In previous posts I discussed the local uncertainty h_n^{(1)} and the block entropy H_n. We also saw the rapid decrease in H_n uncertainty -- this is due to sampling errors. With larger n our empirical probability estimate n_i / n gets worse because it would require more samples to "fill up the histogram", i.e. there's missing ngrams and the seen ngrams have a bad probability estimate.

There's a vast number of papers and techniques on reducing the bias and variance on entropy estimates and I decided to write a few posts about this, with the aim to find the best entropy estimators for our (local) uncertainty measure. With a suitable entropy estimator we will be able to analyse local predictabilities conditioned on larger number of previous symbols with higher significance.

The estimator we used so far is called "plug-in" or maximum likelihood estimator and is defined as

\hat{H}(X) = - \sum_X \hat{P}(x) log \hat{P}(x)

where \hat{P}(x) = n_x / n, so the number of occurrences of the word x in the whole space. It is well known that the MLE estimator is negatively biased. What does that mean?

Bias is the difference between the expected value of the estimator and the true value, i.e.

Bias[\hat{H}] = E[\hat{H}] - H

Thus it is a measure of how accurate the estimator is. The negative bias of MLE means that it underestimates the true entropy of a system. Let us check  if we can find this behaviour in data -- theory is nice but data is what counts.

We need to set up a system where we can easily check MLE with the true entropy, so it's easiest if we just sample from a uniform distribution with, e.g. 256 different observations, thus the true probability is 1/256 for each observation; it easily follows that the entropy is then just log(256).

The function biassim below calculates the bias through steps numbers of samples with replacement from a pool of 100'000 uniformly distributed integers between 0 and 256; it then calculates the entropy of the sample. Also notice here that the number of samples and number of possible words are equal, which is obviously  not the case in general. Finally, we subtract the true entropy value trueEV from the estimates.

biassim <- function(n, m, steps) {
	trueEV <- log2(m)
	uniformM<-round(runif(100000, min=0, max=m))
	f <- function(i) {
		sampleN <- sample(uniformM, n, replace = TRUE)
		probN <- sampleN / sum(sampleN)
                entropy(probN[probN > 0])
	}
	accu <- sapply(1:steps, f)
	return(accu - trueEV)
}

We apply the function above as follows:

> r <- biassim(256,256,2000)
> hist(r,breaks=100,xlim=0:-1,main="Bias and variance of MLE")

The histogram should like similar to the plot below.

Screen shot 2010-02-12 at 19.37.36

It shows the negative bias very nicely, with the red vertical line being the true value to be estimated.

If our naive MLE estimator has this systematic underestimation then surely we can easily fix this issue by "shifting" the estimates to the positive side? Yes and no. The simplest bias correction is known under the Miller-Maddow correction and it suggests adding (m - 1)/(2n) to the estimate. The only problem is that generally m, the number of possible words, is unknown, so it would have to be estimated as well.

In the next post we will apply the Miller-Maddow correction with a few other known estimators and analyse which one is most suitable for our purposes.

Comments (0) Trackbacks (0)

No comments yet.


Leave a comment


No trackbacks yet.