Empirical probabilities Entropy function
Just added this simple but often useful entropy function which can be run on any data set with multiple occurrences of symbols. Say your data is the vector
c(1,1,1,2,2,2,3,3,3)
then
entropy.count
calculates the Shannon entropy using empirical/maximum likelihood probabilities for all unique symbols 1,2,3.
entropy.count <- function(entry) {
counts <- lapply(split(entry, as.factor(entry)), length)
counts <- unlist(counts)
ps <- counts / sum(counts)
entropy(ps)
}
The code is now also available through the "Papers / Code" tab above.
Local order and predictabilitiy: Significance testing
The two previous posts described an implementation of a paper about finding local order (return patterns with higher than average predictability of the next symbol) in financial time series.
One important unanswered question so far is about the significance of the local uncertainties . Does a deviation from almost no order ( > 0.99) really mean something or is it due to imprecisions/undersampling of the empirical probabilities? As the original paper notices, the larger values we choose for n, i.e. the more previous trading days we consider to predict the next one, the more ngrams are possible and therefore the more samples we need to approximate the probabilities
more or less accurately.
There's two ways to go:
- As in the original paper, use empirical probabilities and the basic plugin entropy estimator and restrict n to maximally 5, as their significance level K dictates (more to that below)
- Experiment with larger n including more sophisticated probability and enstropy estimators
We will do both. But for now I'll concentrate on the significance level K as introduced in the paper. A so called surrogate sequence of length n is generated out of the partitioned time series. These surrogates have the same mean and standard deviation as the original sequence, you could see it as a random shuffling of the sequence with some further rules. The local uncertainties from the surrogates are called . The significance level K is then calculates as:
Local order and predictability – Implementation
Part 1 discussed a paper on local order and predictability of time series. I will now describe the implementation of the described functions in R.
First we assume that already have our real returns data partitioned into symbols so
is 3. Thus our time series is just a vector of values 0 1 2.
Next, all our functions will consider trajectories of that original vector. I will implement this as a sliding window of length n. So if our sequence is 012020120 the function slide will create the array 012, 120, 202, 020, 201, 012, 120 out of it.
slide <- function(seq,windowsize) {
steps <- length(seq)-windowsize
start <- 1
stop <- windowsize
accu <- array(0,dim=c(steps,windowsize))
for(i in 1:(steps)) {
#print(seq[start:stop])
accu[i,] <- seq[start:stop]
start <- start+1
stop <- start+windowsize-1
}
return(accu)
}
Calculating Entropy the Functional Way
Previously, I wrote a short article on how to implement fold left in R. It was fairly obvious that there must be a builtin function for it in R. At the time, I just assumed it would be "reduce" or it would not exist, however the proper function name is called "Reduce" with a capital R -- as a side note, I do not really understand the naming scheme of functions in the R base library.
So here's the fairly obvious way on how to calculate Shannon's entropy in R using Reduce:
> fentropy <- function(x,y) { x + (-y * log2(y)) }
> Reduce(fentropy, c(0.5,0.5), 0)
[1] 1
> Reduce(fentropy, c(0.25,0.25,0.25,0.25), 0)
[1] 2
First for the binary case with answer 1, and then for four values uniformly distributed.
Last but not least, we could also write an entropy function the "R way" which uses its nice functions which work over vectors:
entropy <- function(ps) {
H = -sum(ifelse(ps>0, ps * log2(ps), 0))
return(H)
}
fold left in R
Often used high order functions in functional programming are left and right folds.
A left fold [foldleft f accu l] applies the head of the list l to the function f together with the accumulator variable accu. The result is the new accumulator which is used in the next recursive call together with the tail of the list.
A left (or right) fold is easily implemented in R as follows:
foldleft <- function(f,accu,l) {
if(length(l)==0) {
accu
} else {
head <- l[1];
tail <- l[-1];
foldleft(f, (f(accu, head)) , tail)
}
}
To see how it works, we could apply it to calculate the variance of a fair die. Remember the variance is just where
is the mean, which is implemented in the following function f:
mean<-sum(1:6)/6
f <- function(accu,i) {
accu+(1/6 * (i-mean)^2)
}
foldleft(f,0,c(1,2,3,4,5,6))
where our last call to foldleft evaluates to 2.91667.