Cointegration and Total-Least-Squares Regression
I just stumbled over a very nice article authored by Paul Teetor on the use of total least-squares-regression in contrast to ordinary-least-squares regression for cointegration tests. This blog post also explains the same topic.
Let’s quickly recall what we do when trying to find a working pairs trading strategy. First, we use one stock price time series to estimate another stock price time series. The linear regression model is A = intercept + beta * B
, where A and B are both time series containing stock prices. From the regression estimation we retrieve the hedge ratio (or beta or regression slope) and an intercept. We can then compute the spreads S as S = A - (intercept + beta * B)
. Finally, we use an augmented Dickey-Fuller (ADF) test to find out whether the spreads are stationary. If they are, this implies that both stocks are in fact cointegrated. Gaps between the two stock prices will not persist infinitely long, but close after some time.
In R you can use the lm
function to create a linear regression estimation: lm(StockA ~ StockB)
, or lm(StockA ~ StockB + 0)
if you want to explicitly set the intercept to 0. The lm function however relies on an ordinary-least-squares regression method, and both articles linked above do a nice job explaining why this is somewhat problematic. To make it short, we would like to be one hedge ratio to be the inverted value of the other, when we switch the dependent and the independent variable. However, this is not the case with ordinary-least-squares regression. This is from Teetor’s article:
“OLS” of course means “ordinary-least-squares regression” whereas “TLS” means “total-least-squares regression”. He goes on writing:
The OLS hedge ratios are inconsistent because 1 / -0.658 = -1.520, not -1.035, which is a substantial difference. The TLS ratios, however, are consistent because 1 / -0.761 = -1.314.
The reason becomes immediately clear when you look at the pictures provided in the linked articles. In TLS regression the residuals are actually computed orthogonal to the regression line, whereas in ordinary-least-squares they are computed orthogonal to the stock A (if A is the dependent and B the independent variable) or orthogonal to stock B (if A is the independent and B the dependent variable). Using OLS regression “long A and short B” is not the opposite of “long B and short A”. This only becomes the case if we use TLS regression.
But how then can we calculate a total-least-squares regression in R? Unfortunately, the lm function does not provide such functionality. Instead, Teetor suggests to use R’s principal component analysis function princomp
.
The spreads may or may not be far away from calculating them through a OLS regression model. There are a few more details in Teetor’s paper and I really recommend reading through it. In case you want to understand how principal component analysis actually works, there is another good article written by Lindsay Smith.