Entropy estimators and predictability
In previous posts I discussed the local uncertainty and the block entropy
. We also saw the rapid decrease in
uncertainty -- this is due to sampling errors. With larger n our empirical probability estimate
gets worse because it would require more samples to "fill up the histogram", i.e. there's missing ngrams and the seen ngrams have a bad probability estimate.
There's a vast number of papers and techniques on reducing the bias and variance on entropy estimates and I decided to write a few posts about this, with the aim to find the best entropy estimators for our (local) uncertainty measure. With a suitable entropy estimator we will be able to analyse local predictabilities conditioned on larger number of previous symbols with higher significance.
The estimator we used so far is called "plug-in" or maximum likelihood estimator and is defined as
where , so the number of occurrences of the word x in the whole space. It is well known that the MLE estimator is negatively biased. What does that mean?