oneminusp.com Computational Finance, Markets, Programming & co

29Jan/100

Local order and predictabilitiy: Significance testing

The two previous posts described an implementation of a paper about finding local order (return patterns with higher than average predictability of the next symbol) in financial time series.

One important unanswered question so far is about the significance of the local uncertainties h_n(A_1 \dots A_n). Does a deviation from almost no order ( > 0.99) really mean something or is it due to imprecisions/undersampling of the empirical probabilities? As the original paper notices, the larger values we choose for n, i.e. the more previous trading days we consider to predict the next one, the more ngrams are possible and therefore the more samples we need to approximate the probabilities p^{(n)} more or less accurately.

There's two ways to go:

  • As in the original paper, use empirical probabilities and the basic plugin entropy estimator and restrict n to maximally 5, as their significance level K dictates (more to that below)
  • Experiment with larger n including more sophisticated probability and enstropy estimators

We will do both. But for now I'll concentrate on the significance level K as introduced in the paper. A so called surrogate sequence of length n is generated out of the partitioned time series. These surrogates have the same mean and standard deviation as the original sequence, you could see it as a random shuffling of the sequence with some further rules. The local uncertainties from the surrogates are called h_n^S(A_1 \dots A_n). The significance level K is then calculates as:

K_n(A_1 \dots A_n) = \vert \frac{h_n(A_1 \dots A_n) - \langle h_n^S(A_1 \dots A_n) \rangle}{\sigma_{h_n^S}}\vert

where \langle h_n^S \rangle is the mean and \sigma is the standard deviation of the surrogate's answers. Intuitively, this is the difference between the actual uncertainties and mean randomised ones in units of the standard deviation of the surrogates. A K value larger than 2 would mean a confidence level greater than 95%. Due to the exponential nature however we'll just say, the bigger K the better. The reference to this test is from the Nature magazine 379 618 Characterisation of Low-Dimensional Dynamics in the Crayfish Caudal Photoreceptor.

It is not completely clear how this should get implemented. For now, I decided to create 5 surrogate sequences of the length n, and build up the definition above from that.

K <- function(seq, slideseq, ngram) {
	# generate 5 surrogate sequences out of seq
	sur1 <- surrogate(seq, length(ngram))
	sur2 <- surrogate(seq, length(ngram))
	sur3 <- surrogate(seq, length(ngram))
	sur4 <- surrogate(seq, length(ngram))
	p1 <- h_n_cond(sur1, ngram,3)
	p2 <- h_n_cond(sur2, ngram,3)
	p3 <- h_n_cond(sur3, ngram,3)
	p4 <- h_n_cond(sur4, ngram,3)

	s <- sd(c(p1,p2,p3,p4))
	#print(s)
	m <- mean(c(p1,p2,p3,p4))
	p <- h_n_cond(slideseq, ngram,3)
	return( abs( (p - m) / s) )
}

The length of the alphabet is fixed to 3 here in the code, but you can change that easily.

Now I'd like to calculate a sliding window of 4-grams to calculate local uncertainties of the last 200 DJI closes and their corresponding K values. The following function does just that

localallinfo <- function(seq, slideseq, lambda, limit=0:0) {
  len <- dim(slideseq)[1]
  range <- 1:len   if(length(limit) > 1) {
  	  range <- limit
  }
  # we store 2 columns for h_n_cond value and K, and dim(slideseq)[2] for length of word
  v <- array(dim=c(length(range), 2 + dim(slideseq)[2]))
  for(i in range) {
	v[i,] <- c(h_n_cond(slideseq, slideseq[i,], lambda),
                       K(seq, slideseq, slideseq[i,]), slideseq[i,])
  }
  return(v)
}

This will return a 2+n column array with the values for (local uncertainty, K value, ngram). Stored in variable x we have the return value of localallinfo. We can query all the results with high predictability:

> x[x[,1] < 0.932,]
          [,1]      [,2] [,3] [,4] [,5] [,6]
[1,] 0.9308985 30.705164    0    2    1    1
[2,] 0.9308985 13.576514    0    2    1    1
[3,] 0.9246361  6.738199    1    1    2    1
[4,] 0.9308985  7.542447    0    2    1    1
[5,] 0.9246361 13.721803    1    1    2    1
[6,] 0.9245277  9.663806    1    1    1    1

Very satisfying to see this develop so nicely. So what we have here are the patterns, their local uncertainty and the K values, all bigger than quite a bit bigger than 2. We can read this as follows for entry 1:

The ngram pattern 0 2 1 1 has a local uncertainty of 93% with a significance level K of 30.7

Thanks to the R package ggplot2, we generate the graph below with the commands:

> z<-data.frame(A=x[,1],B=1:200,K=x[,2])
> d<-qplot(B,A,data=z,colour=K,xlab="Time",ylab="Local Uncertainty")
> d + scale_colour_gradient(limits=c(1, 15), low="yellow",high="red")

significance

This is the same data as the plot in the previous post, just now color coded with the significance level. We can see the trend that lower uncertainties have higher K values, just like described in the original paper.

In the next post I'll try to play a bit more with graphical representations, K levels, and then will move on to different entropy estimators and applying our code to more time series.

Comments (0) Trackbacks (1)

Leave a comment