Fabian Kostadinov2024-01-16T21:28:26+01:00http://fabian-kostadinov.github.ioFabian Kostadinovnone@none.comMacbook Pro for LLMs - Buyer's Guide in January 20242024-01-16T00:00:00+01:00http://fabian-kostadinov.github.io/2024/01/16/macbook-pro-for-llms-buyers-guide-in-january-2024<p>Given my old Macbook Pro with an Intel i7 is ageing (and quite well, I might say) I was contemplating buying a new Macbook Pro with an M2 or M3 processor. Generally speaking my needs are:</p>
<ol>
<li>software development (e.g. Docker),</li>
<li>video recording (e.g. OBS Studio) and light video editing,</li>
<li>playing around with LLMs running locally mainly to learn and have fun, for more serious work I’d then switch to e.g. Google Colab or Huggingface and rent dedicated GPUs.</li>
</ol>
<p>I am not very much worried about #1 and #2, but running LLMs? I usually tend to spend a lot of money on a Macbook Pro, but then keep it for quite a few years before making the next purchase. So, which model should I focus on?
<span class="more"></span></p>
<p>The main questions for me are:</p>
<ol>
<li>Is buying a Macbook Pro even a good idea for playing with LLMs?</li>
<li>Which processor?</li>
<li>How much memory?</li>
<li>How much is it worth spending?</li>
</ol>
<p>(Note, disk space is not the main need for me, but I don’t want to have less than 1 TB these days.)</p>
<p>Here’s what I learned from doing a lot of research.</p>
<h2 id="is-buying-a-macbook-pro-even-a-good-idea-for-playing--inferencing-with-llms">Is buying a Macbook Pro even a good idea for playing (= inferencing) with LLMs?</h2>
<p>If you are only occasionally running the LLM, then yes: you may consider buying a Macbook Pro. With <a href="https://github.com/ggerganov/llama.cpp">Llama.cpp</a> and/or <a href="https://lmstudio.ai">LM Studio</a> the model can make use of the power of the MX processors. Make sure you understand quantization of LLMs, though. Quantization refers to the process of using fewer bits per model parameter. This will both decrease the model size, but also lead to quality loss. How much exactly is a matter of ongoing research. Quantization is necessary to make larer language models accessible on consumer-grade GPUs, and without it you would require substantially higher amounts of hardware resources (particularly memory, but also more GPU cores).</p>
<p>If you intend to do serious work or development or run the LLM as a server in the background more continuously, then no: buying a Macbook Pro (or any laptop, for that matter) is not the best idea. LLMs drain your battery and the CPU heats up, which is not optimal for a laptop in the long-run. In this case, you probably want to consider at least an Apple Mac Studio instead, or buy a Windows PC with a modern Nvidia GPU. The Nvidia GPU will use a lot more electricity per hour, though, but you get a significantly more powerful GPU for inferencing.</p>
<p>And, just to be clear: Training a serious LLM from scratch is not possible on consumer-grade hardware. You need a very powerful GPU or GPU cluster. You can get those from through Google Colab or Huggingface as paid services. We are talking of inferencing here only, or possibly fine-tuning of a small model at the very best.
If you want to experiment with actual training from scratch, there are some toy examples like <a href="https://github.com/zozoheir/tinyllm">TinyLLM</a> or <a href="https://github.com/kuleshov/minillm">MiniLLM</a> just for learning how everything works.</p>
<h2 id="which-processor">Which processor?</h2>
<p>Obviously, bigger and newer is better, but which processor represents the most value for the money spent? There is a fantastic overview page that really breaks down the differences. Generally speaking, skip all Intel-based CPUs on Macbook Pros as well as M1, M2 or M3. Rather, aim for an M1 Pro/Max/Ultra, M2 Pro/Max/Ultra or M3 Pro/Max version. (At the time of writing there exists no M3 Ultra yet, but this is expected to be available later throughout the year in Mac Studio). This site has more or less all the info you need: <a href="https://github.com/ggerganov/llama.cpp/discussions/4167">https://github.com/ggerganov/llama.cpp/discussions/4167</a></p>
<p>As long as we’re only talking about inferencing (and not training) then there are mainly two distinct performances to measure: input prompt processing (PP) and output token generation (TG). The two speeds may diverge, and it of course depends on your use case which is more important to you. But frequently for regular chatting with your LLM the input prompts tend to be short and the output longer. So, in case of doubt, I’d rather optimize for TG than for PP.</p>
<p>Here’s the crucial point: <em>For PP and TG the most important ingredient of the processor is the processor’s bandwidth, not the clock speed!</em> Take a moment to study the table below, it’s from the page linked to above. Here’s a small excerpt.</p>
<ul>
<li>BW = CPU bandwidth, higher is better</li>
<li>CPU Cores = the GPU cores of the MX processor, higher is better</li>
<li>F16, Q8, Q4 = Three different quantized models. The Q4 will be the model with the lowest number of bits per model parameter, and thus should be the fastest in terms of tokens per second (t/s) but also yield the lowest quality.</li>
<li>PP = Prompt processing with number of tokens per second, higher is better</li>
<li>TG = Token generation with number of tokens per second, higher is better</li>
</ul>
<p>Effect of number of GPU cores (plus clock speeds):</p>
<table>
<thead>
<tr>
<th style="text-align: left"> </th>
<th style="text-align: left">BW [GB/s]</th>
<th style="text-align: left">CPU Cores</th>
<th style="text-align: left">F16 PP [t/s]</th>
<th style="text-align: left">F16 TG [t/s]</th>
<th style="text-align: left">Q8_0 PP [t/s]</th>
<th style="text-align: left">Q8_0 TG [t/s]</th>
<th style="text-align: left">Q4_0 PP [t/s]</th>
<th style="text-align: left">Q4_0 TG [t/s]</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left">M1 Max</td>
<td style="text-align: left">400</td>
<td style="text-align: left">32</td>
<td style="text-align: left">599.53</td>
<td style="text-align: left">23.03</td>
<td style="text-align: left">537.37</td>
<td style="text-align: left">40.2</td>
<td style="text-align: left">530.06</td>
<td style="text-align: left">61.19</td>
</tr>
<tr>
<td style="text-align: left">M2 Max</td>
<td style="text-align: left">400</td>
<td style="text-align: left">38</td>
<td style="text-align: left">755.67</td>
<td style="text-align: left">24.65</td>
<td style="text-align: left">677.91</td>
<td style="text-align: left">41.83</td>
<td style="text-align: left">671.31</td>
<td style="text-align: left">65.95</td>
</tr>
<tr>
<td style="text-align: left">M3 Max</td>
<td style="text-align: left">400</td>
<td style="text-align: left">40</td>
<td style="text-align: left">779.17</td>
<td style="text-align: left">25.09</td>
<td style="text-align: left">757.64</td>
<td style="text-align: left">42.75</td>
<td style="text-align: left">759.76</td>
<td style="text-align: left">66.31</td>
</tr>
</tbody>
</table>
<p>The main point here is: particularly the M2 Max and the M3 Max are astonishingly close to each other - yet their purchase price could easily be 1000$ apart at the time of writing. Do your own math and ask yourself if the gain in PP and TG is significant enough for you to actually spend so much more money.</p>
<p>Which impact does bandwidth have? Let’s compare two the lower spec’d M2 Max and M3 Max with only 30 CPU Cores.</p>
<p>Effect of bandwidth:</p>
<table>
<thead>
<tr>
<th style="text-align: left"> </th>
<th style="text-align: left">BW [GP/s]</th>
<th style="text-align: left">CPU Cores</th>
<th style="text-align: left">F16 PP [t/s]</th>
<th style="text-align: left">F16 TG [t/s]</th>
<th style="text-align: left">Q8_0 PP [t/s]</th>
<th style="text-align: left">Q8_0 TG [t/s]</th>
<th style="text-align: left">Q4_0 PP [t/s]</th>
<th style="text-align: left">Q4_0 TG [t/s]</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left">M1 Max</td>
<td style="text-align: left">400</td>
<td style="text-align: left">32</td>
<td style="text-align: left">599.53</td>
<td style="text-align: left">23.03</td>
<td style="text-align: left">537.37</td>
<td style="text-align: left">40.2</td>
<td style="text-align: left">530.06</td>
<td style="text-align: left">61.19</td>
</tr>
<tr>
<td style="text-align: left">M2 Max</td>
<td style="text-align: left">400</td>
<td style="text-align: left">30</td>
<td style="text-align: left">600.46</td>
<td style="text-align: left">24.16</td>
<td style="text-align: left">540.15</td>
<td style="text-align: left">39.97</td>
<td style="text-align: left">537.6</td>
<td style="text-align: left">60.99</td>
</tr>
<tr>
<td style="text-align: left">M3 Max</td>
<td style="text-align: left">300</td>
<td style="text-align: left">30</td>
<td style="text-align: left">589.41</td>
<td style="text-align: left">19.54</td>
<td style="text-align: left">566.43</td>
<td style="text-align: left">34.3</td>
<td style="text-align: left">567.59</td>
<td style="text-align: left">56.58</td>
</tr>
</tbody>
</table>
<p>See the difference between M2 Max and M3 Max? The lower spec’d M3 Max with 300 GB/s bandwidth is actually <em>not</em> significantly slower/faster than the lower spec’d M2 Max with 400 GB/s - yet again, the price difference for purchasing the more modern M3 Max Macbook Pro is substantial. On the lower spec’d M2 Max and M3 Max you will end up paying a lot more for the latter without any clear gain. Why not save your money and rather rent a cloud GPU in case you ever need it? You get the point.</p>
<p>But that’s not the entire story. Take another look and compare the M1 Max 32 cores with the M2 Max 30 cores, both with 400 GB/s of bandwidth, and the M3 Max 30 cores with 300 GB/s bandwidth. This M1 Max is really not very far from the M3 Max neither, and yet you can get it at a significant discount online if you search a little!</p>
<p>Is that all there is to the comparison? Should you save your money and aim for the cheapest model?</p>
<p>Well, consider this: If all you want is to <em>play occasionally</em> with an LLM, then the upgrade to an M3 Max might not seem worth it. However, you will of course be doing many other things with your Macbook Pro. Many other applications will run single-threaded on the CPU, and not make a lot of use of the GPU. So, the M3 family might still be worth it to power those use cases that might be more common and relevant for your <em>daily</em> work in comparison to <em>occasional</em> playing with LLMs. So, choose wisely.</p>
<h2 id="how-much-memory-ie-vram">How much memory, i.e. (V)RAM?</h2>
<p>What is the effect of having more (V)RAM on your use cases? The answer can also be found <a href="https://github.com/ggerganov/llama.cpp/discussions/4167#discussioncomment-7664186">on the same page</a>. Have a look at below table. The processor is always M2 Max Macbook Pro 16, 8+4 CPU with 38 cores GPU.</p>
<table>
<thead>
<tr>
<th style="text-align: left">(V)RAM</th>
<th style="text-align: left">model</th>
<th style="text-align: left">size</th>
<th style="text-align: left">params</th>
<th style="text-align: left">backend</th>
<th style="text-align: left">ngl</th>
<th style="text-align: left">test</th>
<th style="text-align: left">t/s</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left">96 GB</td>
<td style="text-align: left">llama 7B mostly Q8_0</td>
<td style="text-align: left">6.67 GiB</td>
<td style="text-align: left">6.74 B</td>
<td style="text-align: left">Metal</td>
<td style="text-align: left">99</td>
<td style="text-align: left">tp 512</td>
<td style="text-align: left">674.50 ± 0.58</td>
</tr>
<tr>
<td style="text-align: left">96 GB</td>
<td style="text-align: left">llama 7B mostly Q8_0</td>
<td style="text-align: left">6.67 GiB</td>
<td style="text-align: left">6.74 B</td>
<td style="text-align: left">Metal</td>
<td style="text-align: left">99</td>
<td style="text-align: left">tg 128</td>
<td style="text-align: left">41.79 ± 0.04</td>
</tr>
<tr>
<td style="text-align: left">32 GB</td>
<td style="text-align: left">llama 7B mostly Q8_0</td>
<td style="text-align: left">6.67 GiB</td>
<td style="text-align: left">6.74 B</td>
<td style="text-align: left">Metal</td>
<td style="text-align: left">99</td>
<td style="text-align: left">tp 512</td>
<td style="text-align: left">674.37 ± 0.63</td>
</tr>
<tr>
<td style="text-align: left">32 GB</td>
<td style="text-align: left">llama 7B mostly Q8_0</td>
<td style="text-align: left">6.67 GiB</td>
<td style="text-align: left">6.74 B</td>
<td style="text-align: left">Metal</td>
<td style="text-align: left">99</td>
<td style="text-align: left">tg 128</td>
<td style="text-align: left">40.67 ± 0.05</td>
</tr>
</tbody>
</table>
<p><em>There is no difference in inferencing speeds when running an 8-bit quantized Llama 7B model on the same Macbook Pro model with 96 GB vs 32 GB!</em></p>
<p><a href="https://github.com/ggerganov/llama.cpp/discussions/4167#discussioncomment-7670917">Here’s what Mr Sparc writes</a>:</p>
<blockquote>
<p>Yes, it is expected that the same cpu/gpu spec will have similar performance values for same models to be compared regardless of RAM, as long as the size of the model to be used can be loaded into memory.The amount of RAM is a limiting factor in the size of the model that can be loaded, as only 75% (by default) of the unified memory can be used as VRAM on the GPU.</p>
</blockquote>
<p>In short: The main purpose of having enough memory is to allow loading the entire quantized model into memory. Once this is done it has no further substantial effect on the inferencing speed.</p>
<p>This means: <em>Buying larger amounts of memory (e.g. 96 GB or even 128 GB) does not speed up your inferencing with LLMs, but it allows you to run a bigger LLM.</em></p>
<p>How much memory do the different models need? <a href="https://www.reddit.com/r/LocalLLaMA/comments/157d89h/comment/jt43x4d/?utm_source=reddit&utm_medium=web2x&context=3">This Reddit thread</a> lists the following file sizes (on disk) for the 16 bit unquantized models. Also, according to <a href="https://news.ycombinator.com/item?id=37067933">this thread</a> running a 70B unquantized Llama2 model would require ca 160 GB of memory, and a 64 GB Macbook Pro should be able to run a 70B quantized Llama2 model.</p>
<table>
<thead>
<tr>
<th style="text-align: left">File size on disk</th>
<th style="text-align: left">Model</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left">129 GB</td>
<td style="text-align: left">llama-2-70b</td>
</tr>
<tr>
<td style="text-align: left">129 GB</td>
<td style="text-align: left">llama-2-70b-chat</td>
</tr>
<tr>
<td style="text-align: left">25 GB</td>
<td style="text-align: left">llama-2-13b</td>
</tr>
<tr>
<td style="text-align: left">25 GB</td>
<td style="text-align: left">llama-2-13b-chat</td>
</tr>
<tr>
<td style="text-align: left">13 GB</td>
<td style="text-align: left">llama-2-7b</td>
</tr>
<tr>
<td style="text-align: left">13 GB</td>
<td style="text-align: left">llama-2-7b-chat</td>
</tr>
</tbody>
</table>
<p>As we can see the disk files are relatively large, but the quantization should decrease those figures.</p>
<p>There are some commands you can run to re-allocate the VRAM assigned to the GPU process on your MX processor, see e.g. <a href="https://www.hardware-corner.net/increase-mac-vram/">here</a>. But keep in mind also that quantization can have a significant impact on the required disk/memory size of the LLM. Once more you should ask yourself whether it’s worth all that money just for playing around? You can run the largest LLMs in a rented GPU for some time, that’s probably cheaper than spending all that money on the highest amounts of memory on your laptop.</p>
<p>Somewhere else I read that the operating system should get at least 8 GB of memory. Hence, if you go with 16 GB of memory on your Macbook Pro you unfortunately will only be able to run the smallest LLMs currently available. But already with 32 GB you should be able to do something decent, and if you are willing to spend for 48 GB or 64 GB of memory then also mid-sized LLMs will become available to you. See this discussion for some further hints: <a href="https://github.com/ggerganov/llama.cpp/issues/13">https://github.com/ggerganov/llama.cpp/issues/13</a></p>
<h2 id="how-much-is-it-worth-spending">How much is it worth spending?</h2>
<p>Well, as usual, that’s hard to tell. I assume that if you are someone looking into running LLMs locally then you are someone who enjoys working with tech and is willing to spend a little more on hardware. Personally, for playing with LLMs I would probably aim for an M2 Max or an M3 Max with e.g. 32 GB of memory. The lower spec’d Macbook Pro with 16’’, M3 Max (30 cores), 36 GB memory and 1 TB of SSD will cost you currently 3499$ in the Apple shop. That’s of course quite a bit of money, but if you aim instead at an M2 Max you won’t sacrifice too much GPU speed yet get a substantially cheaper laptop in comparison. But then again, the single-core speed might be more of a concern for you. Or, if you are someone doing heavy video editing, then maybe none of what I said above is even very relevant to you. But then you also should ask yourself if you should not aim for a Mac Studio instead of a laptop. Sure, the Macbook Pro can take such loads, but the constant heavy lifting may shorten the life expectancy of your battery unnecessarily, and your fans might kick in frequently making considerable noise.</p>
<p>It’s tempting to spend lots of money on a luxury Macbook Pro that you probably don’t really need. Be clever and don’t let Apple drain your wallet.</p>
Generative AI-Augmented Program Execution2023-11-09T00:00:00+01:00http://fabian-kostadinov.github.io/2023/11/09/genai-augmented-program-execution<p>The use of Generative AI allows a completely new programming paradigm. I call this paradigm <strong>Generative AI-Augmented Program Execution (GAPE)</strong>. In very brief, GAPE is a programming approach that mixes regular software code with output from GenAI models. The idea is best illustrated with an example. Imagine you have a list of animals: [“horse”, “cat”, “whale”, “goldfish”, “bacteria”, “dove”]. You do not have any further information than those animals. Now you are asked to sort this list according to the animal’s expected or average weight in descending order. By applying common sense reasoning a human returns this ordered list: [“whale”, “horse”, “cat”, “dove”, “goldfish”, “bacteria”]. Assume we implement a (e.g. Python) function <code>sort(list_to_sort: List[str], sort_criterion: str) -> List[str]</code> such that the sorting happens entirely automatically behind the scenes by calling a Large Language Model. Input is a List of strings and a sort criterion description as a string parameter. Both are being sent as a prompt to the LLM, which computes the result. The output is again same list, just ordered according to said criterion. Wild, isn’t it? Welcome to <em>Generative-AI Augmented Program Execution</em>!
<span class="more"></span></p>
<h1 id="a-simple-example-sorting-with-gape">A simple example: Sorting with GAPE</h1>
<p>Let’s have a look at how such a sort function could be implemented.</p>
<p>First, we define a class <em>SortPrompt</em>. As its name indicates it represents a prompt to an LLM. The constructor takes arguments such as a list of items to sort, a sort order (ascending or descending) of type enum, a textual description of which criterion to apply, and it defines the LLM’s system role (“You are an expert in…”). After having created an instance of the class, we can call the <code>to_str()</code> function to obtain a string representation of the prompt.
If we wanted to go a little more fancy here we could also have applied an OOD factory pattern.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">typing</span> <span class="kn">import</span> <span class="n">List</span><span class="p">,</span> <span class="n">Callable</span>
<span class="kn">from</span> <span class="nn">dataclasses</span> <span class="kn">import</span> <span class="n">dataclass</span>
<span class="kn">from</span> <span class="nn">enum</span> <span class="kn">import</span> <span class="n">Enum</span>
<span class="kn">import</span> <span class="nn">textwrap</span>
<span class="k">class</span> <span class="nc">SortOrder</span><span class="p">(</span><span class="n">Enum</span><span class="p">):</span>
<span class="n">ascending</span> <span class="o">=</span> <span class="mi">1</span>
<span class="n">descending</span> <span class="o">=</span> <span class="o">-</span><span class="mi">1</span>
<span class="o">@</span><span class="n">dataclass</span>
<span class="k">class</span> <span class="nc">SortPrompt</span><span class="p">:</span>
<span class="n">system_role</span><span class="p">:</span> <span class="nb">str</span>
<span class="n">items</span><span class="p">:</span> <span class="n">List</span><span class="p">[</span><span class="nb">str</span><span class="p">]</span>
<span class="n">order</span><span class="p">:</span> <span class="n">SortOrder</span>
<span class="n">criterion</span><span class="p">:</span> <span class="nb">str</span>
<span class="k">def</span> <span class="nf">to_str</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-></span> <span class="nb">str</span><span class="p">:</span>
<span class="n">items</span> <span class="o">=</span> <span class="s">", "</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">items</span><span class="p">)</span>
<span class="n">prompt</span> <span class="o">=</span> <span class="s">"""
### Instructions
You are {system_role}.
Your task is to sort below list of items in {order} order according to {criterion}.
Once you have sorted them, provide a reasoning for the sort order you have selected.
Return the output as a JSON object in the format:
{{
"result": ["item 1", "item 2", ..., "item n"],
"sort_order": "EITHER ascending OR descending",
"sort_criterion": "the criterion applied to sort",
"reason": "description why you sorted all items accordigly"
}}
### Example
List of items to sort: ["house", "racing horse", "bicyle", "pizza"]
Sort order: ascending
Sort criterion: purchasing price
Expected output:
{{
"result": ["pizza", "bicycle", "racing horse", "house"],
"sort_order": "ascending",
"sort_criterion": "purchasing price",
"reason": "A house is more expensive than a racing horse. A racing horse is more expensive than a bicycle. A bicycle is more expensive than a pizza."
}}
### Input
{items}"""</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">system_role</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">system_role</span><span class="p">,</span> <span class="n">order</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">order</span><span class="p">.</span><span class="n">name</span><span class="p">,</span> <span class="n">criterion</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">criterion</span><span class="p">,</span> <span class="n">items</span><span class="o">=</span><span class="n">items</span><span class="p">)</span>
<span class="k">return</span> <span class="n">textwrap</span><span class="p">.</span><span class="n">dedent</span><span class="p">(</span><span class="n">prompt</span><span class="p">).</span><span class="n">strip</span><span class="p">()</span>
</code></pre></div></div>
<p>As you can see, the <em>SortPrompt</em> class is a simple container for data required to create the entire prompt. I am using the <code>@dataclass</code> decorator to save myself from creating an <code>__init__()</code> constructor function as well as getters and setters. The <code>to_str()</code> function assembles the entire prompt and returns it, so we can feed it into our sort function later on.</p>
<p>The prompt uses various means to (hopefully) improve the LLM’s response quality.</p>
<ol>
<li>It defines the expected output of the LLM as a JSON object.</li>
<li>It provides one (or even few-shot) examples.</li>
<li>It asks the LLM to provide a reasoning for its response.</li>
</ol>
<p>Maybe there are even more tricks we could apply here, but we are right now not interested in <a href="https://en.wikipedia.org/wiki/Prompt_engineering">prompt design best practices</a>.</p>
<p>Next, we feed our prompt as a string into a simple sort function.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">sort</span><span class="p">(</span><span class="n">prompt</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">completion</span><span class="p">:</span> <span class="n">Callable</span><span class="p">,</span> <span class="n">print_prompt</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span> <span class="o">-></span> <span class="n">List</span><span class="p">[</span><span class="nb">str</span><span class="p">]:</span>
<span class="k">if</span> <span class="n">print_prompt</span><span class="p">:</span>
<span class="k">print</span><span class="p">(</span><span class="n">prompt</span><span class="p">)</span>
<span class="k">try</span><span class="p">:</span>
<span class="n">openai_response</span> <span class="o">=</span> <span class="n">completion</span><span class="p">(</span><span class="n">prompt</span><span class="p">)</span>
<span class="n">json_str</span> <span class="o">=</span> <span class="n">openai_response</span><span class="p">[</span><span class="s">"choices"</span><span class="p">][</span><span class="mi">0</span><span class="p">][</span><span class="s">"message"</span><span class="p">][</span><span class="s">"content"</span><span class="p">]</span> <span class="c1"># type: str
</span> <span class="n">json_response</span> <span class="o">=</span> <span class="n">json</span><span class="p">.</span><span class="n">loads</span><span class="p">(</span><span class="n">json_str</span><span class="p">)</span> <span class="c1"># type: dict
</span> <span class="n">response</span> <span class="o">=</span> <span class="n">json_response</span><span class="p">[</span><span class="s">"result"</span><span class="p">]</span> <span class="c1"># type: list
</span> <span class="k">except</span> <span class="nb">Exception</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"sort: Error: </span><span class="si">{</span><span class="n">e</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
<span class="k">return</span> <span class="n">response</span>
</code></pre></div></div>
<p>For simplicity I have designed the sort function to take the prompt as a string input. If a developer does not want to use the suggested <em>SortPrompt</em> class for any reason she is free to write her own prompt entirely from scratch. In this way we are not technically enforcing usage of <em>SortPrompt</em>, it’s just a suggestion.
The sort function also takes a Callable as a second function parameter that we call <em>completion</em>. The completion function is of course that function which makes the actual call to the LLM.
The return type is a list of strings, i.e. simply the ordered version of the list that we provided as part of the prompt string.</p>
<p>Here’s a simple implementation for the completion function using the OpenAI API:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">openai</span>
<span class="c1"># Don't forget to set openai.api_key = "...your key..." before calling:
# openai.api_key = os.getenv("OPENAI_API_KEY")
</span>
<span class="k">def</span> <span class="nf">openai_completion</span><span class="p">(</span><span class="n">prompt</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-></span> <span class="nb">any</span><span class="p">:</span>
<span class="k">try</span><span class="p">:</span>
<span class="n">openai_response</span> <span class="o">=</span> <span class="n">openai</span><span class="p">.</span><span class="n">ChatCompletion</span><span class="p">.</span><span class="n">create</span><span class="p">(</span>
<span class="n">model</span> <span class="o">=</span> <span class="s">"gpt-3.5-turbo-1106"</span><span class="p">,</span>
<span class="n">messages</span> <span class="o">=</span> <span class="p">[</span>
<span class="p">{</span><span class="s">"role"</span><span class="p">:</span> <span class="s">"user"</span><span class="p">,</span> <span class="s">"content"</span><span class="p">:</span> <span class="n">prompt</span><span class="p">},</span>
<span class="p">],</span>
<span class="n">temperature</span> <span class="o">=</span> <span class="mf">0.00000001</span>
<span class="p">)</span>
<span class="k">except</span> <span class="nb">Exception</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"openai_completion: Exception: </span><span class="si">{</span><span class="n">e</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
<span class="k">return</span> <span class="n">openai_response</span>
</code></pre></div></div>
<p>I’ve set the temperature to a very low value (or possibly 0.0) to make the LLM’s output as deterministic as possible. In my experience though at least OpenAI’s gpt-3.5-turbo is not fully deterministic, even if the temperature is set to 0.0.</p>
<p>Okay, let’s put everything together:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">sort_params</span> <span class="o">=</span> <span class="n">SortPrompt</span><span class="p">(</span>
<span class="n">system_role</span> <span class="o">=</span> <span class="s">"a helpful assistant"</span><span class="p">,</span>
<span class="n">items</span> <span class="o">=</span> <span class="p">[</span><span class="s">"cat"</span><span class="p">,</span> <span class="s">"rat"</span><span class="p">,</span> <span class="s">"mouse"</span><span class="p">,</span> <span class="s">"elephant"</span><span class="p">,</span> <span class="s">"fly"</span><span class="p">,</span> <span class="s">"tiger"</span><span class="p">,</span> <span class="s">"bacteria"</span><span class="p">,</span> <span class="s">"goldfish"</span><span class="p">],</span>
<span class="n">order</span> <span class="o">=</span> <span class="n">SortOrder</span><span class="p">.</span><span class="n">descending</span><span class="p">,</span>
<span class="n">criterion</span> <span class="o">=</span> <span class="s">"their physical weight"</span>
<span class="p">)</span>
<span class="n">prompt</span> <span class="o">=</span> <span class="n">sort_params</span><span class="p">.</span><span class="n">to_str</span><span class="p">()</span>
<span class="n">sorted_list</span> <span class="o">=</span> <span class="n">sort</span><span class="p">(</span><span class="n">prompt</span><span class="p">,</span> <span class="n">openai_completion</span><span class="p">,</span> <span class="n">print_prompt</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">sorted_list</span><span class="p">)</span>
</code></pre></div></div>
<p>Expected output: <code>["elephant", "tiger", "cat", "rat", "mouse", "goldfish", "fly", "bacteria"]</code>.</p>
<h1 id="a-plethora-of-possibilities">A plethora of possibilities</h1>
<p>Generative AI-augmented programming offers an almost endless number of such possibilities. Sorting is just one possibility. Let’s look at some examples to illustrate.</p>
<h2 id="list-functions">List functions</h2>
<ul>
<li>A <code>filter</code> function that takes a list of strings and filters out those that don’t fulfill a given criterion. Example: Take a list of animals but return only those that are carnivores. We do not define upfront what the concept of “carnivore” actually means, instead we leave this to the LLM to figure out.</li>
<li>A <code>map</code> function that takes a list of strings and applies some other function (described in the prompt) to each element in the list. Example: Take a list of animals and always return the male name of the animal if available (<em>stallion</em> instead of <em>horse</em>, <em>tomcat</em> instead of <em>cat</em>, and so on).</li>
<li>A <code>find</code> function that takes a list of strings and a matching criterion and returns the first element encountered where the matching criterion is given or <code>None</code> if no matching element exists. As an extension, it might also be run in reverse starting with the last element. Example: Please find the first element that is a city in France: [“Washington”, “Rome”, “Lyon”, “Stockholm”, “Singapore”, “Moscow”, “Canberra”, “Marseille”].</li>
<li>A <code>find_all</code> function that does same as the <em>find</em> function but matches all elements and returns a list of strings.</li>
<li>A <code>contains</code> function that takes a list of strings and a match criterion (for example an element circumscribed with different words) and returns <em>True</em> if the list contains such an element, or <em>False</em> otherwise. Example: Does the following list contain any mammals: [“jellyfish”, “spider”, “trout”, “toad”, “hare”, “eagle”]? Note that we don’t provide any clear explanation what constitutes a “mammal” here, that’s the LLM’s task to find out.</li>
</ul>
<p>And many more.</p>
<h2 id="boolean-functions">Boolean functions</h2>
<ul>
<li>A <code>condition</code> function that takes a given statement and returns <code>True</code> or <code>False</code> depending on whether the statement is true or false. Example: The statement “Planet Earth is the largest planet in the solar system.” is clearly false. Hence, the condition function returns <code>False</code>.</li>
<li>An <code>equal</code> or perhaps <code>synonym</code> function that takes two elements plus an equality comparator criterion. It then compares both elements according to the criterion given and returns <em>True</em> if both are equal, and <em>False</em> otherwise. Example: Are “ability” and “capability” equal?</li>
<li>An <code>is_part_of</code> function that takes two items and returns <em>True</em> if the first item is a part of the second, or <em>False</em> otherwise. Example: Is “wheel” part of “car”?</li>
</ul>
<p>And many more.</p>
<h2 id="general-knowledge-functions">General knowledge functions</h2>
<ul>
<li>A <code>more_details</code> function that takes a concept and returns more details on it. If the LLM is allowed to use a plugin (e.g. web search or search of a specific database) then it can extend the knowledge base to draw from to the internet. Example: Input is “Hotels in Amsterdam”, the LLM is allowed to do a search to a third party website with an overview on hotels. The LLM triggers the search and ultimately returns results.</li>
</ul>
<p>And many more.</p>
<h2 id="user-input-callback-functions">User input callback functions</h2>
<p>We could also reverse the entire approach and ask a user for feedback. Imagine that we don’t fully trust the output of an LLM. We want to build in a user check. So, our functions above are enriched with a callback option. For example, once the LLM returned a response the program flow makes a call to a user interface and asks the user to validate the LLM’s response. However, unless the user is expected to sit in front of the computer we could also add a callback function such that the user’s response is captured asynchronously at a later point in time. (I’m sure this can probably be engineered in a better way, but I just wanted to make the point nevertheless.)</p>
<h1 id="gape-as-a-paradigm">GAPE as a paradigm</h1>
<p>I would bet that many readers think that the entire approach is both fascinating and also completely weird at the same time. Or simply completely irrelevant. Why on earth should anyone use an LLM for such things given the imprecision, vagueness and straight-out hallucinations that LLMs are plagued with these days?
Another objection might be that in many situations there does not even exist a proper concept to rely on in the first place. For example: Is a whale heavier than a horse? Well, that might depend on the type of whale we are referring to! A very small (or very young) whale might possibly be lighter than a big horse. So, sorting a list of animals according to their weight would result in a bunch of nonsense, no?</p>
<p>Both arguments are of course true. And yet, both arguments totally miss the point. Because: We are talking about <em>common sense reasoning</em> here. If we wanted to have mathematical precision we could simply have used traditional programming in the first place! It is exactly because the LLMs are vague, imprecise and so on that they are of interest to us. The novelty is the “productive lack of precision” that is the game changer here. LLMs are as dirty in their knowledge as our own! If I ask a friend whether a whale is heavier than a horse, she might say something along the lines: “Well, generally speaking yes, but you could refer to a really small whale, and then it would not be the case anymore.” And that is exactly the beauty of it! <strong>GAPE is not about achieving 100% correctness or accuracy. Instead, it is about coming to the most reasonable conclusion in a given situation.</strong> GAP is not programming for correctness and flawless execution logic like traditional programming paradigms. Instead it is more or less mimicking the reasoning that humans are doing: full of imperfection, full of contradictions, but frequently just reasonable.</p>
<p>I hear some readers saying: Oh, I get that, but what the heck should this be useful for?</p>
<p>Well, that is a question I’ll still have to figure out myself. I have some preliminary ideas, but nothing to share here for the moment. For now, just imagine that you are sitting in front of a very weird 15 year old. This youth has an enourmously broad knowledge, but makes many mistakes and sometimes gets even very basic facts about the world completely wrong. The person never gets hungry and thirsty and never needs sleep. How would you make best use of such a person around you? Certainly, you would not expect the person to solve all mysteries of life. Also, you would not expect the person to know anything about a set of documents you have stored on your harddrive to which nobody in the world except yourself have access. Instead, the person would be very useful for offboarding and automating relatively simple workflows that require a common sense world knowledge. You would not have to teach the person that rain falls from the sky (rather than in the reverse direction), the person would tacitly know such stuff. What if you gave that person something more complicated to do, like reading a legal contract of a few dozen pages? Well, surely the person would not get everything right. Understanding contracts is complicated. But then again, it probably would get most simple questions right. And it would never be tired, no matter how many questions you throw at it. If you explained a few extra points (e.g. by extending your prompt) to the person then the person might even learn enough about how to interpret a contract that from that point onwards the responses would improve.</p>
<p>Compare that to what you have toda: Another person (i.e. your computer) who always gets everything 100% right - but only in an extremely narrowly defined area, and only under ideal circumstances. If you forget to tell a single bit of information it just stares at you with a blank expression.</p>
<p>Generative AI-augmented program execution is about combining both types of expertise in an intelligent manner: the precision and flawlessness of formal logic and programming, but also its narrowness and limitedness. And the imprecision and opaqueness of Generative AI / LLMs, but also their breadth of knowledge and capabilities.</p>
<p>And that allows a breathtaking new way of how to create applications of a kind we have never seen before.</p>
Advanced Prompt Architectures - Querying and processing denormalized structured data2023-10-30T00:00:00+01:00http://fabian-kostadinov.github.io/2023/10/30/prompt-architectures-querying-exploded-data<p>In the past I was focusing mentally a lot on how large language models (LLMs) allow new ways how to interact with text data. Recently, while talking to a colleague, it suddenly dawned on me that large language models actually also enable new ways how to “query” or “process” your structured tabular data. In this article I will compare four different approaches:</p>
<ol>
<li>Querying structured data via APIs.</li>
<li>Querying structured data with text-to-sql or similar.</li>
<li>Processing structured data that was denormalized (“exploded”) before feeding it into an LLM.</li>
<li>Querying the vector embeddings of structured data that was denormalized.
<span class="more"></span></li>
</ol>
<p>All approaches ultimately serve a similar need. Imagine you have tabular data such as a CSV file or a database table. Using the LLM you would like to “query” or “process” your dataset and get responses back. I am putting “querying” and “processing” here in double quotes because these terms can be interpreted in multiple ways.</p>
<h1 id="four-approaches-to-querying-or-processing-structured-data">Four approaches to querying or processing structured data</h1>
<h2 id="1-llm-plus-apis-querying-your-data-through-an-api">1. LLM plus APIs: Querying your data through an API</h2>
<p>This approach has already been popularized with <a href="https://openai.com/blog/chatgpt-plugins">ChatGPT plugins</a>. In brief, the LLM interacts with an API that hides the querying logic. The LLM is just responsible to provide the right querying parameters whereas the creation of the query and its execution is hidden behind the API. Also <a href="https://python.langchain.com/docs/use_cases/apis">Langchain supports this approach</a>.</p>
<h2 id="2-text-to-sql">2. Text-to-SQl</h2>
<p>This is a very powerful approach. The LLM creates SQL (or any other formal query language like Sparql) from a query description in natural language. An agent then runs the query against a database and returns the results to the LLM. The LLM reports the results in natural language again. This approach is described for example in more detail in <a href="https://python.langchain.com/docs/use_cases/qa_structured/sql?ref=blog.langchain.dev">Langchain’s documentation on question answering on structured data</a>.</p>
<h2 id="3-processing-denormalized-structured-data-by-feeding-it-into-an-llm">3. Processing denormalized structured data by feeding it into an LLM</h2>
<p>While the previous two approaches are powerful to return highly accurate and formally correct results (as long as the query is correct), there are situations where correctness of the result is not the main goal. We could perhaps state that both approaches described previously are aimed at maximizing precision, and this third approach is aimed at maximizing recall. (This is not entirely accurate but hopefully illustrates the point I am about to make.)</p>
<p>Structured data are usually stored in a normalized format in a database. If you are working with LLMs, <strong>it might make a lot of sense to denormalize your data first whenever you intend to feed it directly into your LLM.</strong> The denormalized data is much easier for the LLM to interpret, and hence you can expect a much higher quality of response from the LLM.</p>
<p>Here is an example. Imagine that you have a dataset that contains a column with a <em>gender</em> attribtute. Data is stored as simple characters: An ‘F’ for women (females), an ‘M’ for men (males), an ‘O’ for others, and a ‘U’ for unknown or unspecified. As long as you are querying these data with a formal query language such as SQL you can perfectly make sense of this specification. The interpretation of what the single characters mean is somehow encoded in your application logic.
However, from the perspective of an LLM those letters do not carry a lot of meaning. For the LLM understanding the context of single characters is much harder than understanding entire words. Chances are the LLM does not fully grasp the meaning of the letters ‘F’, ‘M’, ‘O’, and ‘U’. However, in the very moment you <em>denormalize</em> your data and replace the single characters with actual words - ‘female’, ‘male’, ‘other’, ‘unknown’ - the LLM now has a lot more information it can work with.</p>
<p>But why should you even want to have the LLM with all its imprecision to query your data rather than something as precise as a SQL query? Simply because the LLM can actually <em>understand and interpret</em> your data in a way the SQl query cannot. For example, the LLM might understand that is unlikely for children to have an income of 100’000 EUR per year, or that only women can give birth to a child. Consider this example:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>I will give you now an excerpt from a CSV file. The columns are:
1. age in years
2. income in USD
3. level of education
4. gender
Your task is to find rows which have a combination that seems to be unreasonable. Output should be the line followed by an arrow -> and a very brief explanation.
Here is the CSV:
age,income,level_of_education,gender
46,100000,college,male
21,30000,high school,female
7,140000,university,male
62,65000,high school,male
</code></pre></div></div>
<p>Note the row of a 7-year old kid earning 140k USD with a university degree. Here is the actual answer using ChatGPT v3.5:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Certainly, here are the rows with combinations that seem to be unreasonable, followed by an arrow (->) and a brief explanation:
1. 7,140000,university,male -> It is highly unusual for a 7-year-old to have an income of $140,000 and a university-level education. This combination is likely incorrect or an outlier in the data.
</code></pre></div></div>
<p>Note how vaguely my prompt is formulated, all I am instructing is to find “unreasonable combinations” without giving any further hint.</p>
<p>Let’s try a harder example.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>I will give you now an excerpt from a CSV file. The columns are:
1. year of birth; note that we are in the year 2023 currently
2. income in USD
3. level of education, where: 0 means "no education", 1 means "basic education", 2 means "high school", 3 means "college", 4 means "higher education"
4. gender, where: 'M' means male, 'F' means female, 'O' means other, 'U' means unknown
Your task is to find rows which have a combination that seems to be unreasonable. Output should be the line followed by an arrow -> and a very brief explanation.
Here is the CSV:
age,income,level_of_education,gender
1977,100000,3,M
2002,30000,2,F
2016,140000,4,M
1961,65000,2,M
</code></pre></div></div>
<p>I have replaced the age with the birth year and the degree of education with a categorical value. Here’s the answer:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Thank you for providing the CSV data. Here are the rows with combinations that seem to be unreasonable, followed by an arrow (->) and a brief explanation:
1. 1977,100000,3,M -> While the year of birth (1977) and income ($100,000) are reasonable, it's unusual for a male with a college-level education to have been born in 1977 and still be in the workforce, given that we are in the year 2023. This suggests the person might not have retired, which is less common.
2. 2016,140000,4,M -> An individual born in 2016 with a higher education level (4) and a high income of $140,000 is highly unlikely, as they would be very young and typically not have completed higher education yet. This combination seems unrealistic.
Please note that the explanations are based on the data provided, and they highlight inconsistencies or unusual aspects in the dataset.
</code></pre></div></div>
<p>As you can see, now the LLM starts to hallucinate. The second answer is correct, but the first one does not make any sense. A person born in the year 1977 would be 46 in the year 2023 and hence not be retired yet.</p>
<p>Takeaway message:</p>
<blockquote>
<p><strong>Denormalizing the data improves the likelihood for the LLM to be able to properly interpret the data.</strong></p>
</blockquote>
<h2 id="4-querying-denormalized-structured-data-as-an-embedding-vector">4. Querying denormalized structured data as an embedding vector</h2>
<p>The fourth approach is to first denormalize the structured data as described previously, then to create an embedding vector, index all vectors, and finally use Retrieval Augmented Generative AI (RAG) to query all vectors. Obvsiouly, this will not yield results that are as precise as queries on structured data. Here, the intention is a different one. Imagine that your structured data is enriched with a little bit of unstructured data. Or the opposite, you have some unstructured data that you enrich additionally with some structured data such as metadata.</p>
<p><img src="/public/img/2023-10-30-querying-denormalized-structured-data-as-an-embedding-vector.png" alt=""Querying denormalized structured data as an embedding vector"" /></p>
<p>Querying those embedding vectors should now yield “semantically similar rows” compared to your original structured data. A formal query (e.g. SQL) on structured data is not prepared for identifying semantically similar rows in your data. Hence, this could constitute a way how to decrease precision and potentially increase recall in your information retrieval system.</p>
<h1 id="summary-of-all-four-techniques">Summary of all four techniques</h1>
<p>Here is a brief overview on all four approaches.</p>
<table>
<thead>
<tr>
<th> </th>
<th>1. Querying data through API</th>
<th>2. Text-to-SQL</th>
<th>3. Processing denormalized data</th>
<th>4. Querying denormalized data as embedding vector</th>
</tr>
</thead>
<tbody>
<tr>
<td><em>Optimized for</em></td>
<td>Precision</td>
<td>Precision</td>
<td>Recall</td>
<td>Recall</td>
</tr>
<tr>
<td><em>Suited if</em></td>
<td>You need the precision of a structured query</td>
<td>You need the precision of a structured query</td>
<td>You want the LLM’s power to “understand” and interpret your data</td>
<td>You want to enable semantic searching capabilities on your dataset</td>
</tr>
<tr>
<td><em>Advantage</em></td>
<td>Non-technical people can query structured data</td>
<td>Non-technical people can query structured data</td>
<td>LLM can interpret results</td>
<td>Allows semantic searching</td>
</tr>
<tr>
<td><em>Disadvantage</em></td>
<td>Complexity of interoperability between LLM and API; no integration with own APIs possible for commercially available products (e.g. OpenAI)</td>
<td>LLM may formulate wrong queries; danger of data deletion?</td>
<td>Requires data denormalization; LLM may hallucinate; restricted maximum prompt size</td>
<td>Results may be insufficiently precise</td>
</tr>
</tbody>
</table>
How I created a semantic Q&A bot for >13k recorded dreams using OpenAI, Langchain and Elasticsearch2023-06-12T00:00:00+02:00http://fabian-kostadinov.github.io/2023/06/12/how-i-created-a-semantic-qa-bot-with-openai-embeddings-api-elasticsearch-and-langchain<p>I have been fascinated with dream interpretation for a long time. So, over the weekend I decided to create a database of >13k recorded dreams I scraped from the web. The database allows me to perform semantic searches rather than just plain-text searches, and beyond that it even allows GPT-enabled question answering. For example, I can ask the Q&A bot questions like: “Please search all dreams containing animals, and give me a list of their activities.” Not only does the bot have an understanding of what constitutes an animal (deer, elephant, snake, spider…), but it provides ChatGPT-like, meaningful answers in the style of ChatGPT. Here’s how I did it.<span class="more"></span></p>
<p>Honoarable mention goes to <a href="https://dylancastillo.co/semantic-search-elasticsearch-openai-langchain/">Dylan Castillo whose tutorial</a> inspired me to do so.</p>
<p>These were the steps I had to take:</p>
<ol>
<li>Create a Selenium-based web scraper to scrape >13k dream reports</li>
<li>Install Elasticsearch on my laptop to store the embeddings vectors</li>
<li>Create vectors for all dreams using the OpenAI embeddings API and Langchain, then index all vectors in my local Elasticsearch instance</li>
<li>Create a semantic search engine and a semantic Q&A bot</li>
</ol>
<p>Here’s the tech stack I was using:</p>
<ul>
<li>Python v3.11 (although 3.10 would also have been sufficient, I guess)</li>
<li>Elasticsearch v8.8.1 (minimum is v8.0)</li>
<li>Langchain v0.0.196 (one day later Langchain’s version is already at v0.0.198…)</li>
<li>OpenAI APIs</li>
<li>My MacBook Pro 2015 with 16 GB memory and Intel i7 2.8 GHz Quad-Core CPU</li>
</ul>
<p>Let’s look into these steps in more detail.</p>
<h1 id="step-1-create-a-selenium-based-web-scraper">Step 1: Create a Selenium-based web scraper</h1>
<p>There are various online forums where users can post their dreams publicly and engage in discussions on their meanings. I will not go into too much details on how to scrape them as you can easily find many tutorials out there, but automation with Selenium definitely is your friend.</p>
<p>The websites I scraped from were not always stable. Quite frequently, the sites did not respond and timed out for otherwise well-functioning URLs. Hence, I had to keep a list of already scraped sites. I only chose websites that used forum software with standardised HTML tags upon which I could rely to extract the actual text content of the dream. One assumption was that the first post in a discussion thread would typically contain the description of the dream itself. Obviously, this assumption might not always be true, but for my fun project this level of accuracy was more than enough.</p>
<p>During the scraping I made sure that I also stored some metadata on each reported dream, like the post’s date and time, the author’s nickname if publicly available, the exact URL, a running ID for each dream, and a few more details.</p>
<p>In total, I created a local dataset with >13k dreams with a sufficient quality to obtain some useful results.</p>
<h1 id="step-2-install-elasticsearch-and-kibana-v881-on-my-laptop">Step 2: Install Elasticsearch and Kibana v8.8.1 on my laptop</h1>
<p><a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/knn-search.html">Starting from v8.0 Elasticsearch supports both exact and approximate k-nearest neighbours search for embedding vectors</a> using the <a href="https://arxiv.org/abs/1603.09320">hierarchical navigable small world graph (HNSW) algorithm</a>. Langchain in turn supports various vector databases such as <a href="https://docs.pinecone.io/docs/openai">OpenAI Pinecone</a>, <a href="https://www.trychroma.com">ChromaDB</a> or Elasticsearch. The advantage of Elasticsearch over e.g. Pinecone should be obvious: It’s an enterprise-ready, battle-tested search engine, and it does not cost you a lot of extra money if you host it on your own servers.</p>
<p>Initially, I tried to download and install both Elasticsearch and Kibana v8.8.1 as containers and then run them on Docker on my MacBook Pro. How should I say: I failed. Not sure what was wrong. Initially I thought Kibana or Elasticsearch was running out of memory, but after making some changes it seemed Kibana somehow could not properly communicate with Elasticsearch. At that point I gave up and decided to install them directly on my laptop, so I would not have to fiddle around with container settings too. That worked much better, and after some time I had a running Elasticsearch and Kibana instance.</p>
<p>One thing that I struggled quite a bit with: Elasticsearch by default enables SSL (https) connections. Unencrypted connections (http) are blocked by default. Great for security, not so great for local development. So, even for a connection from your Python code running localhost to Elasticsearch running localhost I had to fiddle around with security certificates.</p>
<p>First, I had to re-create the existing SSL security certificates provided by Elasticsearch into a different format. Python required a .pem file, whereas Elasticsearch provided a http.p12 file. However, during the conversion process I required first some obscure “XPack import password” or whatever that was used behind the scenes by Elasticsearch. Note that this password had not been disclosed to me during the installation process itself. With some Google Fu I finally found the right command:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>elasticsearch/bin/elasticsearch-keystore show xpack.security.http.ssl.keystore.secure_password
</code></pre></div></div>
<p>Equipped with the import password I could then follow <a href="https://cdax.ch/2022/02/20/elasticsearch-python-workshop-1-the-basics/">this tutorial</a>. The key command to convert from .p12 to .pem file is:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>openssl pkcs12 -in elasticsearch/certs/http.p12 -cacerts -out elasticsearch/certs/python_es_client.pem
</code></pre></div></div>
<p>After that I could copy the python_es_client.pem file to my local project. As we’ll see later, I had to use this certificate file in my Langchain code to access Elasticsearch.</p>
<h1 id="step-3-create-embeddings-vectors-using-openai-embeddings-api-and-langchain">Step 3: Create embeddings vectors using OpenAI embeddings API and Langchain</h1>
<p>Now I was ready to create embedding vectors for my recorded dreams.</p>
<p><strong>Note that I was using Langchain v0.0.196. At the time of writing Langchain is under heavy development and changes on a daily basis. It is not guaranteed that interfaces remain the same over the next few weeks or months. Furthermore, due to the heavy development it is not infrequent that some APIs are somewhat inconsistent and don’t exactly follow a syntax as you’d expect from some other examples. If something does not work, check the docs, or even better: Check the code itself including the unit tests.</strong></p>
<p>First, I wanted to know the average and maximum number of words per document, because that would ultimately determine the costs for using the <a href="https://platform.openai.com/docs/guides/embeddings">OpenAI Embeddings API</a>. I created a small script to tokenize my recorded dreams using the Space library. Below is a chart with the results. As we can easily spot the vast majority of recorded dreams contained less than 2000 words, and many recorded dreams contain actually less than 500 words. Ultimately, I ended up paying just ca 2.70 USD for indexing all >13k dreams using the <em>text-embedding-ada-002 model</em>, so really affordable.</p>
<p><img src="/public/img/2023-06-13-distribution-of-num-words-in-recorded-dreams.png" alt=""Distribution of number of words per recorded dreams"" /></p>
<p>Here is the code to create the embeddings and immediately index them with Elasticsearch:</p>
<p>File <em>openai_api_key.py</em>:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">apikey</span> <span class="o">=</span> <span class="s">'<OpenAI API Key>'</span>
</code></pre></div></div>
<p>This just contains the (paid) OpenAI API key I generated through the OpenAI website. It might be wise to set an upper financial limit to your monthly payments in case you’re dealing with a little larger number of documents you want to work with.</p>
<p>File <em>dream-indexer.py</em>:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">langchain.embeddings.openai</span> <span class="kn">import</span> <span class="n">OpenAIEmbeddings</span>
<span class="kn">from</span> <span class="nn">langchain.text_splitter</span> <span class="kn">import</span> <span class="n">RecursiveCharacterTextSplitter</span>
<span class="kn">from</span> <span class="nn">langchain.vectorstores</span> <span class="kn">import</span> <span class="n">ElasticVectorSearch</span>
<span class="kn">from</span> <span class="nn">langchain.document_loaders.csv_loader</span> <span class="kn">import</span> <span class="n">CSVLoader</span>
<span class="kn">from</span> <span class="nn">open_api_key</span> <span class="kn">import</span> <span class="n">apikey</span>
<span class="k">def</span> <span class="nf">main</span><span class="p">():</span>
<span class="c1"># First line contains the headers
</span> <span class="n">loader</span> <span class="o">=</span> <span class="n">CSVLoader</span><span class="p">(</span><span class="n">file_path</span><span class="o">=</span><span class="s">'./data/dreams-collection-all.csv'</span><span class="p">,</span> <span class="n">csv_args</span><span class="o">=</span><span class="p">{</span>
<span class="s">'delimiter'</span><span class="p">:</span> <span class="s">','</span><span class="p">,</span>
<span class="s">'quotechar'</span><span class="p">:</span> <span class="s">'"'</span><span class="p">,</span>
<span class="s">'fieldnames'</span><span class="p">:</span> <span class="p">[</span><span class="s">'Page'</span><span class="p">,</span> <span class="s">'Dataset Index'</span><span class="p">,</span> <span class="s">'URL'</span><span class="p">,</span> <span class="s">'Post ID'</span><span class="p">,</span> <span class="s">'Post Author'</span><span class="p">,</span> <span class="s">'Post Date'</span><span class="p">,</span> <span class="s">'Post Title'</span><span class="p">,</span> <span class="s">'Post Content'</span><span class="p">]</span>
<span class="p">})</span>
<span class="c1"># Use Langchain CSV loader to load data to memory
</span> <span class="n">data</span> <span class="o">=</span> <span class="n">loader</span><span class="p">.</span><span class="n">load</span><span class="p">()</span>
<span class="c1"># Split documents into meaningful units to index, i.e. first into paragraphs,
</span> <span class="c1"># then into sentences, then into words etc.
</span> <span class="n">text_splitter</span> <span class="o">=</span> <span class="n">RecursiveCharacterTextSplitter</span><span class="p">.</span><span class="n">from_tiktoken_encoder</span><span class="p">(</span>
<span class="c1"># I did not set any specific values, but according to docs default values are:
</span> <span class="c1"># chunk_size=1000, chunk_overlap=0
</span> <span class="p">)</span>
<span class="n">documents</span> <span class="o">=</span> <span class="n">text_splitter</span><span class="p">.</span><span class="n">split_documents</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
<span class="c1"># Create the actual embedding vectors using the OpenAI embedding API
</span> <span class="n">embeddings</span> <span class="o">=</span> <span class="n">OpenAIEmbeddings</span><span class="p">(</span><span class="n">openai_api_key</span><span class="o">=</span><span class="n">apikey</span><span class="p">)</span>
<span class="c1">#print(embeddings)
</span>
<span class="c1"># Wire everything together and index the embedding vectors into the local Elasticsearch
</span> <span class="n">db</span> <span class="o">=</span> <span class="n">ElasticVectorSearch</span><span class="p">.</span><span class="n">from_documents</span><span class="p">(</span>
<span class="n">documents</span><span class="p">,</span>
<span class="n">embeddings</span><span class="p">,</span>
<span class="n">elasticsearch_url</span><span class="o">=</span><span class="s">"https://<ES_USER>:<ES_PASSWORD>@localhost:9200"</span><span class="p">,</span>
<span class="n">index_name</span><span class="o">=</span><span class="s">"dream-db-index"</span><span class="p">,</span>
<span class="n">ssl_verify</span><span class="o">=</span><span class="p">{</span>
<span class="s">"verify_certs"</span><span class="p">:</span> <span class="bp">True</span><span class="p">,</span>
<span class="s">"basic_auth"</span><span class="p">:</span> <span class="p">(</span><span class="s">"<ES_USER>"</span><span class="p">,</span> <span class="s">"<ES_PASSWORD>"</span><span class="p">),</span>
<span class="s">"ca_certs"</span><span class="p">:</span> <span class="s">"./src/certs/python_es_client.pem"</span><span class="p">,</span>
<span class="p">}</span>
<span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">db</span><span class="p">.</span><span class="n">client</span><span class="p">.</span><span class="n">info</span><span class="p">())</span>
<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">"__main__"</span><span class="p">:</span>
<span class="n">main</span><span class="p">()</span>
</code></pre></div></div>
<p>The approach goes like:</p>
<ol>
<li>Use Langchain’s <code>CSVLoader</code> to load the CSV file containing forum posts from disk. Every post contains multiple metadata like ID, URL etc. The actual content of the post is in the last field ‘post_content’. Note that I was not dropping any metadata, but also indexed that. Later on, this was helpful for e.g. obtaining the post’s original URL while doing Q&As. On my laptop everything fit nicely into memory, so no issues there.</li>
<li>Then I used a <a href="https://python.langchain.com/en/latest/modules/indexes/text_splitters/examples/recursive_text_splitter.html">RecursiveCharacterTextSplitter</a> to split every post to be indexed into sections, sections into sentences, sentences into words, and words potentially into tokens. With a RecursiveCharacterTextSplitter a section typically equates a paragraph (defined by a double line break “\n\n”), but there is a lot more to it. There are two default parameters which I also did not modify: <code>chunk_size</code> is the maximum number of character length per section that is allowed. If a section (or paragraph) is longer than the number of characters, the section is simply truncated. The default value is 1000 characters. <code>chunk_overlap</code> is the number of characters that sections are allowed to overlap. This can be important to set if one must be very sure not to lose any text while embedding or indexing. The default value is 0 characters overlapping. In fact, tokenization is a very complicated problem in natural language processing. I did not bother about the details of the implementation, but you could use Spacy or NLTK and similar libraries for your purposes. You can use the <em>from_tiktoken_encoder()</em> factory function to create a CharacterTextSplitter instance.</li>
<li>Next, we create a wrapper object for the <a href="https://python.langchain.com/en/latest/modules/models/text_embedding/examples/openai.html?highlight=openaiembeddings#openai">OpenAI embeddings API</a>, handing it the API key.</li>
<li>Finally, we assemble everything together using the <a href="https://python.langchain.com/en/latest/modules/indexes/vectorstores/examples/elasticsearch.html?highlight=ElasticVectorSearch#elasticvectorsearch-class">ElasticVectorSearch class</a>. By calling the <em>from_documents()</em> function we actually initiate both the embedding calls and subsequent indexing to Elasticsearch. Replace <em><ES_USER></em> and <em><ES_PASSSWORD></em> with the Elasticsearch username and the user’s Elasticsearch password. Also, notice the use of the <code>ssl_verify</code> parameter plus the usage of the <code>https</code> URL we are pointing to. It took me a very long time to figure out the <a href="https://github.com/hwchase17/langchain/blob/master/tests/integration_tests/vectorstores/test_elasticsearch.py">exact usage of the ssl_verify parameter</a>, and ultimately I realised that apparently the underlying implementation had changed in the meanwhile. Notice that we are handing it a reference to the .pem SSL certificate that we created previously. In case of doubt about the exact syntax of the ssl_verify parameter make sure you check the <a href="https://github.com/hwchase17/langchain/blob/master/tests/integration_tests/vectorstores/test_elasticsearch.py">corresponding unit test case</a>. Some authors out there recommend to entirely disable SSL, or even more extreme, disable secure connections altogether. Obviously, in a secure environment like your laptop that’s not an issue but there’s a danger that this gets forgotten and the code ends up in production somehow. Better to do it right from the beginning.</li>
<li>At the end, we print few infos on the Elasticsearch index we just created. Just for our own info.</li>
</ol>
<p>The unzipped CSV file containing all dreams was roughly 19 MB on my disk. Using the OpenAI text-embedding-ada-002 model resulted in an Elasticsearch index with >13k vectors of 1536 dimensions resulting in a total index size of roughly 300-400 MB of disk space. So, easily doable on a modern laptop.</p>
<p>Awesome! What’s next?</p>
<h1 id="step-5-create-a-semantic-search-engine-and-qa-bot">Step 5: Create a semantic search engine and Q&A bot</h1>
<p>Finally, we will be building a simple REST API around our bot through which we can run queries. The API will offer two HTTP POST endpoints, one for a more traditional semantic search on documents that only returns the documents themselves, and a second one with a more advanced Q&A bot that understands ChatGPT like complex user inquiries. The general approach is:</p>
<ol>
<li>Every query we send first needs to be translated into a vector with the help of the OpenAI embeddings API.</li>
<li>When we have the vector, the ElasticVectorSearch or the RetrievalQAWithSourcesChain can search through the vector database to find the most similar documents and provide an answer.</li>
</ol>
<p>File <em>server-app.py</em>:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Start process from a terminal with uvicorn:
# uvicorn app:app --reload
</span>
<span class="kn">from</span> <span class="nn">fastapi</span> <span class="kn">import</span> <span class="n">FastAPI</span>
<span class="kn">from</span> <span class="nn">langchain.chains</span> <span class="kn">import</span> <span class="n">RetrievalQAWithSourcesChain</span>
<span class="kn">from</span> <span class="nn">langchain.chat_models</span> <span class="kn">import</span> <span class="n">ChatOpenAI</span>
<span class="kn">from</span> <span class="nn">langchain.embeddings.openai</span> <span class="kn">import</span> <span class="n">OpenAIEmbeddings</span>
<span class="kn">from</span> <span class="nn">langchain.vectorstores</span> <span class="kn">import</span> <span class="n">ElasticVectorSearch</span>
<span class="kn">from</span> <span class="nn">openai_api_key</span> <span class="kn">import</span> <span class="n">apikey</span>
<span class="c1"># OpenAI embeddings API wrapper
</span><span class="n">embedding</span> <span class="o">=</span> <span class="n">OpenAIEmbeddings</span><span class="p">(</span><span class="n">openai_api_key</span><span class="o">=</span><span class="n">apikey</span><span class="p">)</span>
<span class="c1"># Reference to the Elasticsearch vector database to query
</span><span class="n">db</span> <span class="o">=</span> <span class="n">ElasticVectorSearch</span><span class="p">(</span>
<span class="n">elasticsearch_url</span><span class="o">=</span><span class="s">"https://<ES_USER>:<ES_PASSWORD>@localhost:9200"</span><span class="p">,</span>
<span class="n">index_name</span><span class="o">=</span><span class="s">"dream-db-index"</span><span class="p">,</span>
<span class="n">embedding</span><span class="o">=</span><span class="n">embedding</span><span class="p">,</span>
<span class="n">ssl_verify</span><span class="o">=</span><span class="p">{</span>
<span class="s">"verify_certs"</span><span class="p">:</span> <span class="bp">True</span><span class="p">,</span>
<span class="s">"basic_auth"</span><span class="p">:</span> <span class="p">(</span><span class="s">"<ES_USER>"</span><span class="p">,</span> <span class="s">"<ES_PASSWORD>"</span><span class="p">),</span>
<span class="s">"ca_certs"</span><span class="p">:</span> <span class="s">"./src/certs/python_es_client.pem"</span><span class="p">,</span>
<span class="p">}</span>
<span class="p">)</span>
<span class="c1"># Instance of a Q&A bot.
</span><span class="n">chain</span> <span class="o">=</span> <span class="n">RetrievalQAWithSourcesChain</span><span class="p">.</span><span class="n">from_chain_type</span><span class="p">(</span>
<span class="n">llm</span><span class="o">=</span><span class="n">ChatOpenAI</span><span class="p">(</span><span class="n">openai_api_key</span><span class="o">=</span><span class="n">apikey</span><span class="p">,</span> <span class="n">temperature</span><span class="o">=</span><span class="mi">0</span><span class="p">),</span>
<span class="n">chain_type</span><span class="o">=</span><span class="s">"stuff"</span><span class="p">,</span>
<span class="n">retriever</span><span class="o">=</span><span class="n">db</span><span class="p">.</span><span class="n">as_retriever</span><span class="p">(),</span>
<span class="n">return_source_documents</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="c1"># Create a FastAPI REST server
</span><span class="n">app</span> <span class="o">=</span> <span class="n">FastAPI</span><span class="p">()</span>
<span class="c1"># REST API that does not do a lot, but helpful to see if everything's
# running fine.
</span><span class="o">@</span><span class="n">app</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">"/"</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">index</span><span class="p">():</span>
<span class="k">return</span> <span class="p">{</span>
<span class="s">"message"</span><span class="p">:</span> <span class="s">"Hello World!"</span>
<span class="p">}</span>
<span class="c1"># REST API for semantic search. This will return documents that are "similar"
# to the given search terms but not understand ChatGPT-like queries.
</span><span class="o">@</span><span class="n">app</span><span class="p">.</span><span class="n">post</span><span class="p">(</span><span class="s">'/similarity'</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">ask</span><span class="p">(</span><span class="n">query</span><span class="p">:</span> <span class="nb">str</span><span class="p">):</span>
<span class="n">result</span> <span class="o">=</span> <span class="n">db</span><span class="p">.</span><span class="n">similarity_search</span><span class="p">(</span><span class="n">query</span><span class="p">)</span>
<span class="k">return</span> <span class="n">result</span>
<span class="c1"># REST API for questions and answers. This will be able to interpret ChatGPT
# like conversational complex queries.
</span><span class="o">@</span><span class="n">app</span><span class="p">.</span><span class="n">post</span><span class="p">(</span><span class="s">'/qa'</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">ask</span><span class="p">(</span><span class="n">query</span><span class="p">:</span> <span class="nb">str</span><span class="p">):</span>
<span class="n">result</span> <span class="o">=</span> <span class="n">chain</span><span class="p">({</span><span class="s">'question'</span><span class="p">:</span> <span class="n">query</span><span class="p">},</span> <span class="n">return_only_outputs</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
<span class="k">return</span> <span class="n">result</span>
</code></pre></div></div>
<p>Let’s look at the details.</p>
<ol>
<li>First, we create again a wrapper object around the OpenAI API using the <code>OpenAIEmbeddings</code> object and handing it the api key as a parameter.</li>
<li>Next, we create again an <code>ElasticVectorSearch</code> object which serves as an interface to our local Elasticsearch instance.</li>
<li>
<p>Here comes the really cool part. We are calling <code>RetrievalQAWithSourcesChain.from_chain_type</code> to create an actual Q&A bot which can answer complex queries about our data. There also exists a simpler object called <a href="https://python.langchain.com/en/latest/reference/modules/chains.html?highlight=retrievalqa#langchain.chains.RetrievalQA">RetrievalQA</a> which does <em>not</em> return references to the documents found. In many situations you however want to have references to the underlying documents, so in case of doubt you can check yourself what the bot is telling you! What was very confusing to me: in Langchain v0.0.196 there exists apparently multiple ways how to execute the actual Q&A query. Depending on the path you follow the syntax might look significantly distinct. important difference in syntax between using RetrievalQA vs RetrievalQAWithSourcesChain. Compare the official docs <a href="https://python.langchain.com/en/latest/modules/chains/index_examples/vector_db_qa.html?highlight=retrievalqa">here</a> and <a href="https://python.langchain.com/en/latest/modules/indexes/vectorstores/examples/weaviate.html?highlight=retrievalqawithsourceschain#question-answering-with-sources">here</a>. Notice how they differ between RetrievalQA and RetrievalQAWithSourcesChain classes. One expects you to call <code>qa.run('my query')</code> whereas the other one expects you to call <code>chain({'question': 'my query'}, return_only_outputs=False</code>. Very different syntax indeed.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="c1"># RetrievalQA according to docs here: https://python.langchain.com/en/latest/modules/chains/index_examples/vector_db_qa.html?highlight=retrievalqa
</span> <span class="n">query</span> <span class="o">=</span> <span class="s">"Who let the dogs out?"</span>
<span class="n">result</span> <span class="o">=</span> <span class="n">qa</span><span class="p">.</span><span class="n">run</span><span class="p">(</span><span class="n">query</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">result</span><span class="p">)</span>
<span class="c1"># RetrievalQAWithSourcesChain.from_chain_type according to docs here: https://python.langchain.com/en/latest/modules/indexes/vectorstores/examples/weaviate.html?highlight=retrievalqawithsourceschain#question-answering-with-sources
</span> <span class="n">query</span> <span class="o">=</span> <span class="s">"Who let the dogs out?"</span>
<span class="n">result</span> <span class="o">=</span> <span class="n">chain</span><span class="p">(</span>
<span class="p">{</span><span class="s">"question"</span><span class="p">:</span> <span class="s">"Who let the dogs out?"</span><span class="p">},</span>
<span class="n">return_only_outputs</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span>
<span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">result</span><span class="p">)</span>
</code></pre></div> </div>
<p>Hopefully, this will be improved in a future version of the library.</p>
</li>
</ol>
<p>Alright, we can now run our application with uvicorn v0.21.1. This will open a website running on https://localhost:8000/docs. From there, we can run calls to our APIs. Obviously, not yet for end users, but great for quick testing of our two semantic search and Q&A services.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>uvicorn server-app:app --reload
</code></pre></div></div>
<p>Here, <code>server-app</code> refers to the name of the server-app.py file, whereas <code>app</code> refers to the FastAPI server instance variable defined within the file. The server is now available through http://localhost:8000, calling it returns:</p>
<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="nl">"message"</span><span class="p">:</span><span class="s2">"Hello World!"</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>
<p>Things become a little more interesting if you navigate to http://localhost:8000/docs. This will open an OpenAPI/Swagger-like webpage through which you can call the REST API endpoints from a UI.</p>
<p>Here’s what you should see:</p>
<p><img src="/public/img/2023-06-12-fastapi-rest-endpoints.png" alt=""OpenAPI UI with FastAPI REST endpoints"" /></p>
<p>Let’s use the Q&A bot to post this query:</p>
<blockquote>
<p>Find me all the dreams containing an animal, then provide me a list with their activities.</p>
</blockquote>
<p>Here’s the result:</p>
<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
</span><span class="nl">"question"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Find me all the dreams containing an animal, then provide me a list with their activities."</span><span class="p">,</span><span class="w">
</span><span class="nl">"answer"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Dream 1: Animals running past the dreamer on a meadow. No specific activity mentioned.</span><span class="se">\n</span><span class="s2">Dream 2: Dead cats on the ground in a forest, vultures on a tree, a white duck in the dreamer's house. Dreamer tries to catch the duck but is afraid of it. Later, dreamer is in a classroom with her dogs and mother, and the girls in the class talk to her about her pregnancy. No specific activity mentioned for the animals.</span><span class="se">\n</span><span class="s2">Dream 3: Dreamer frequently dreams of animals, sometimes with a threatening aspect. In one dream, the dreamer is in an apartment with a cage of guinea pigs, and a dog comes in. Dreamer is afraid of the dog but picks it up and it turns into a cat. In another dream, dreamer is surrounded by lions, leopards, elephants, crocodiles, and possibly snakes, and they want to eat her. Dreamer tells them she wants to live. No specific activity mentioned for the animals in the first dream, but in the second dream, the animals want to eat the dreamer. </span><span class="se">\n</span><span class="s2">"</span><span class="p">,</span><span class="w">
</span><span class="nl">"sources"</span><span class="p">:</span><span class="w"> </span><span class="s2">"./data/dreams-collection-all.csv"</span><span class="p">,</span><span class="w">
</span><span class="nl">"source_documents"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"document"</span><span class="p">:</span><span class="w"> </span><span class="s2">"...document content 1..."</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"document"</span><span class="p">:</span><span class="w"> </span><span class="s2">"...document content 2..."</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"document"</span><span class="p">:</span><span class="w"> </span><span class="s2">"...document content 3..."</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"Document"</span><span class="p">:</span><span class="w"> </span><span class="s2">"...document content 4..."</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>
<p>Imagine how difficult it would be to achieve this with a traditional text search engine. Fantastisch!</p>
<p>Note that the number of returned documents is limited to four in our example. This means, the returned answer or the returned documents are neither exhaustive nor necessarily complete. This is important to understand. In a semantic search engine there is no ultimate measure of correctness. Therefore, as a user we have to specify how many results we want to inspect. Of course, we end up paying for every query due to the usage of the OpenAI service. For this reason, there are multiple chain classes that have a <code>max_tokens_limit</code> and <code>reduce_k_below_max_tokens</code> fields which can be used to increase or decrease the number of returned tokens. If your budget is bigger you can afford increasing the limits a bit, but you’ll end up paying more.</p>
<p>In case you need to measure more thoroughly the quality of the Q&A system there is also a way how to <a href="https://python.langchain.com/en/latest/use_cases/evaluation/data_augmented_question_answering.html?highlight=retrievalqa">evaluate the returned answers using QAEvalChain</a>.</p>
<h1 id="conclusion">Conclusion</h1>
<p>When I find some time I’ll add a simple Streamlit interface. Making calls from the FastAPI UI is not very user-friendly, and creating something better looking in Streamlit really does not take a lot of time. For some inspiration have <a href="https://www.youtube.com/watch?v=MlK6SIjcjE8">a look at this YouTube tutorial</a>.
Also, I want to learn more about the agent capabilities of Langchain. There are multiple videos on YouTube on this topic.</p>
<p>Another thing I believe will soon be needed is memory for my semantic Q&A bot. One issue that I realised soon was that the bot really has absolutely no memory of what was just said a moment before. For example, I ran a pretty intelligent query, and then wanted to build upon the results. The bot immediately after querying had forgotten about my previous query, hence I always needed to design the queries in a way that they constitute the “union” of my previous and my current intent. The Q&A queries may become pretty long with many conditions: “Find me all documents that contain XYZ or ABC but not QRS, then provide me a list where condition is that bla bla bla.” It would be a lot easier to first run some parts of this query, and then tell the bot to remember what it just returned and build on it. Obviously, this could also create some errors by the user forgetting about the statefulness of the bot, so ideally there should be a functionality to enable or disable state.</p>
<p>One thing I noticed during this work was how useful a semantic Q&A bot could be for hermeneutical-qualitative analysis of texts. In fact, I believe it could constitute a bridge between qualitative and quantitative text analysis. Imagine you have a document set of transcriptions of interviews you conducted. Going through all of them manually is a huge amount of work. A semantic Q&A bot allows you to get specific answers within seconds. Also, if the bot returns the document sources it’s easy to cross-check the veracity of the bot’s claims, although it might still have missed some important results. As we all know, LLMs do make mistakes sometimes.</p>
<hr />
<h1 id="references">References</h1>
<ul>
<li>Dylan Castillo’s blog post: <a href="https://dylancastillo.co/semantic-search-elasticsearch-openai-langchain/">https://dylancastillo.co/semantic-search-elasticsearch-openai-langchain/</a></li>
<li>Connecting from Python to Elasticsearch: <a href="https://cdax.ch/2022/02/20/elasticsearch-python-workshop-1-the-basics/">https://cdax.ch/2022/02/20/elasticsearch-python-workshop-1-the-basics/</a></li>
<li>OpenAI APIs: <a href="https://platform.openai.com/docs/guides/embeddings">https://platform.openai.com/docs/guides/embeddings</a></li>
<li>ssl_verify parameter in ElasticVectorSearch.from_documents function: <a href="https://github.com/hwchase17/langchain/blob/master/tests/integration_tests/vectorstores/test_elasticsearch.py">https://github.com/hwchase17/langchain/blob/master/tests/integration_tests/vectorstores/test_elasticsearch.py</a></li>
<li>Langchain Crashcourse - Youtube video by Nicholas Renotte: <a href="https://www.youtube.com/watch?v=MlK6SIjcjE8">https://www.youtube.com/watch?v=MlK6SIjcjE8</a></li>
</ul>
GPT Models as an Operating System / Communication Protocol for Autonomous Agents2023-05-18T00:00:00+02:00http://fabian-kostadinov.github.io/2023/05/18/gpt-models-as-an-operating-system-communication-prototcol-for-autonomous-agents<p>While GPT models have been around for a few years, it was only after the arrival of <a href="https://chat.openai.com/">OpenAI’s ChatGPT</a> that they caught the wider attention of the public. Frantically, managers and IT specialists alike are searching for ways how they can make best use of this new technology. As usual, the first ideas will turn out to be not necessarily the most mature, and it will take some time until the full potential – and the restrictions – of these GPT models are generally understood.<span class="more"></span></p>
<p>With the notable exception of <a href="https://github.com/Significant-Gravitas/Auto-GPT">Auto-GPT</a> most proposed use cases are focused in one way or another on classical Natural Language Understanding (NLU) cases: Answering questions about a corpus of documents, text summarisation, simplifying language, chatting with users, and so on. Probably every large corporation has already had the idea of creating their own HelpdeskGPT assistant, and project teams are trying out whether or not it is possible to train ChatGPT on the company’s internal data to enable it to answer questions on internal processes.</p>
<p>While these developments are exciting enough, I am arguing that treating GPT models as a sophisticated NLU engine are not yet tapping into the full potential this technology has. Two very recent developments are worth noting in this context.</p>
<p>The first one are OpenAI’s latest attempts to allow selected developers create <a href="https://platform.openai.com/docs/plugins/">plugins</a>. A plugin is, in very short, nothing but the ability of ChatGPT as a natural language model to make a call to a specific REST API. This approach will allow completely new ways how users can interact with ChatGPT. Imagine a user talking in natural language with a GPT model telling it to book a restaurant table at a given place and time. Or think of a user browsing the current cinema movies on play and then deciding to book a show for two. Clearly, there are many details yet to be sorted out such as security concerns, financial transactions and so on, but the intention is clear: Create a virtual assistant that helps a user navigate the complexity of the world by means of natural language.</p>
<p>The second latest development is the emergence of <a href="https://github.com/Significant-Gravitas/Auto-GPT">Auto-GPT</a>. Similarly to ChatGPT’s plugins <a href="https://github.com/Significant-Gravitas/Auto-GPT-Plugins">Auto-GPT’s plugins</a> allow a user to make calls to third-party APIs. Currently, many users are however fascinated by its ability to develop “execution plans”, i.e. breaking down a goal-oriented task into multiple sub-steps and execute them individually. Yet, <a href="https://georgesung.github.io/ai/autogpt-arch/">Auto-GPT’s approach and its architecture</a> are distinct from ChatGPT’s. Auto-GPT is running locally on your own computer (e.g. in a VM). Its magic is to tell ChatGPT quite precisely upfront i) what the allowed responses are, ii) to use ChatGPT to figure out the intent of the user, and iii) to select the appropriate response in a JSON format plugging in missing parameters. Auto-GPT is parsing the selected response and derives from it what API call to make. APIs in this context are represented as <a href="https://github.com/Significant-Gravitas/Auto-GPT-Plugins">Auto-GPT plugins</a> (which are different from ChatGPT plugins technically) running locally in the same computer as Auto-GPT itself. This allows a much greater degree of freedom, for example the manipulation of local files, calls from local to the internet, calling third-party services from local that are not allowed from within ChatGPT and so on. Obviously, security is a very big concern here.</p>
<p>What both examples demonstrate to a different degree is the possibility to use a GPT model as the glue code to navigate from user intents expressed in natural language to technical commands expressed in a formalized language (JSON). As stated above, it is this ability that will allow building virtual assistants of a yet unknown complexity and power.</p>
<p>But the same technology allows yet another type of use case that in my personal view will be the real game changer. The <a href="https://arxiv.org/abs/2302.04761">translation of intents expressed in natural language to formal API calls</a> can be used to connect multiple autonomous agents with each other. The agents serve as entry and exit points for third-party systems to communicate with each other, acting as translators of formal API calls to natural language and vice versa back and forth. In this way, the GPT model becomes a communication protocol in natural language for two autonomous agents to talk to each other. Alternatively, we could state that GPT serves as an operating system for autonomous agents to interact with each other. The crucial point here is that in the past all attempts to build autonomous interacting agents remained limited by the creation of a common formalized communication protocol or ontology. Each agent had to be taught exactly how to interact with each other agent it ever needed to communicate with, thus resulting in an exponential explosion of complexity due to the increasing number of communication paths and protocols between agents. All attempts to use formal logic (e.g. predicate logic and the like) remained relatively limited to very narrow domains.. With GPT it should be possible to reduce the number of communication protocols to only one: natural language. Agents will be able to talk to each other like humans do. So, a virtual assistant would literally ask a restaurant agent in natural language – even overcoming language translation barriers: “Can I book a table for 4 persons next Thursday evening?” With the second agent answering: “Sure, do you prefer a table next to the bar or to the doorway? And what is your preferred time?” Obviously, this will result in lots of failed attempts to communicate between agents, misinterpretations, perhaps even Freudian slips and the like; in short, it will result in a style of communication among agents that is not too far away from how humans communicate. Occasionally, agents might even create confusion and chaos, for example ordering the wrong pair of shoes without the user noticing. But that is the cost to pay for a universal communication protocol.</p>
<p>Obviously, establishing a network of autonomous agents communicating with natural language among themselves will require a lot more engineering than is currently available. We will need ways how to identify agents among each other, secured connections (e.g. through public/private key exchanges), secure financial transactions, security for sharing specific items of sensitive data and more. In the next few months and years we will see an explosion of added services running on top of GPT models as the underlying operating system or communication bus to enable agents talk to each other.</p>
<p>The possibilities of autonomous agents talking to each other via natural language are beyond my comprehension at this point. As are the dangers. Imagine an ERP system noticing a shortage of certain screws. It sends an event to its own autonomous agent. The agent reaches to a GPT model or a specific “registry agent” or “service search agent” asking whether there is any other agent out there who could sell that specific type of screw. The GPT model provides two distinct sales agents that have this particular type of screw on stock. Together with this information it also provides the required API protocol, and pre-fills the data to be sent to each of those protocols. The ERP agent now makes calls to the APIs of both sales agents, and finally decides from which to buy the screws. Obviously, some common payment gateway must be used. Few days later the shipment is made to the company buying the screws.
What is genuinely different from the past in this example is that the ERP agent does not need prior knowledge of the exact protocols the other two sales agents are having. This is the part that GPT takes care of. GPT is the middleman here to bridge the gap between the natural language the ERP agent and the sales agents are using and the formal APIs they are offering. GPT must have enough understanding to be able to interpret the APIs of all those agents. But this is <a href="https://arxiv.org/abs/2302.04761">a problem that is already being actively researched</a>. The most promising approach so far I have seen is to teach the GPT model upfront about the correct (and incorrect) use of each API that needs to be added. With a few examples, it should be able to generalize the API use by itself, and thus be able to add yet another API to its list of offered API endpoints. Thus, GPT becomes the aforementioned communication bus for autonomous agents.</p>
<p>What Auto-GPT already offers at this point is an extension to this setup in the form of <a href="https://weaviate.io/blog/autogpt-and-weaviate">“memory”</a> (e.g. <a href="https://ernestodotnet.medium.com/step-by-step-guide-to-enhancing-autogpt-with-pinecone-vector-database-in-colab-unlocking-the-power-a2869b0b9753">through Pinecone</a>). Once two agents have succeeded in communicating with each other a lot of the prior finding and setup work for making API calls is no longer needed. The agent should have learned by now how to contact those two sales agents, and it can keep this knowledge in memory. Only in a situation where unexpectedly the interaction with any of the two sales agents is no longer working does the ERP agent have to contact again the GPT communication bus to figure out what’s the right interaction model with these two or yet some new sales agents. How exactly this is implemented is beyond the scope of this article, the reader is referred to Auto-GPT once more.</p>
How to test Firestore security rules web v9 with Jest for your Angular app2022-12-09T00:00:00+01:00http://fabian-kostadinov.github.io/2022/12/09/how-to-test-firestore-security-rules-with-jest-in-angular<p>Previously, I wrote why <a href="/2022/12/08/do-not-use-karma-and-jasmine-to-write-unit-tests-for-firestore-security-rules-in-angular-14/">using Karma as a test runner to unit test your Firestore security rules for your Angular app is a bad idea</a>. Let’s see then how you can use Jest in combination with <a href="https://firebase.google.com/docs/rules/emulator-setup">Firebase Firestore emulators</a>. I’ll be using <a href="https://angular.io/">Angular v14</a>, <a href="https://firebase.google.com/">Firebase web v9</a>, the node package <a href="https://www.npmjs.com/package/@firebase/rules-unit-testing">@firebase/rules-unit-testing v2.0.5</a> and <a href="https://jestjs.io/">Jest v29.3</a>.<span class="more"></span></p>
<h1 id="the-setup">The setup</h1>
<p>Jest will be used as our unit testing framework and as a test runner. In our unit tests I will use the node package @firebase/rules-unit-testing to connect to a Firestore emulator process running on localhost:8080. The Firebase emulator suite contains various useful tools to develop and test your Angular app locally, i.e. without the need to integrate with a production database running in Google cloud. Not only do you not have to read from and write to your production database instance thus messing around with your productive data, but you can also save money from not having to run requests against a paid Google cloud service. Simply develop your security rules locally and then deploy them to your production environment running in the Google cloud. <br />
For this tutorial, I assume that you have already:</p>
<ol>
<li>created an Angular v14 app,</li>
<li>and installed the Firebase CLI tools including the Firebase emulators.
The <a href="https://firebase.google.com/docs/rules/manage-deploy#use_the">command to initialize your Firestore</a> to your Angular app is <code>firebase init firestore</code>, by the way.</li>
</ol>
<h1 id="step-1-setting-up-the-environment">Step 1: Setting up the environment</h1>
<p>Let’s first install Jest inside our Angular app folder. The good news are that you don’t even have to remove Karma/Jasmine, both can be installed in parallel with each other without interfering.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cd my-angular-app
npm install --save-dev jest
</code></pre></div></div>
<p>Optionally, you might also want to install the @types/jest node package if you intend to use Jest types globally in your unit tests. (I don’t use them myself, so I don’t install them.)</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>npm install --save-dev @types/jest
</code></pre></div></div>
<p>Second, we’ll need the @firebase/rules-unit-test Node package. We won’t need this for production, hence I install it as a development dependency only.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>npm install --save-dev @firebase/rules-unit-testing
</code></pre></div></div>
<p>That was easy.</p>
<p>I’m not a big fan of making Jest available globally. Rather, I want to start it with a corresponding script specified in my package.json file. So, in package.json I add a new script:</p>
<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
</span><span class="err">...</span><span class="w">
</span><span class="nl">"scripts"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="err">...</span><span class="p">,</span><span class="w">
</span><span class="nl">"test:firestore"</span><span class="p">:</span><span class="w"> </span><span class="s2">"jest --detectOpenHandles src/firebase-test-cases/fiestore.rules.test.js"</span><span class="p">,</span><span class="w">
</span><span class="err">...</span><span class="p">,</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="err">...</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>
<p>This allows me to switch to my Angular app folder and run the unit tests contained in <code>src/firebase-test-cases/firestore.rules.test.js</code> with the command:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>npm run test:firestore
</code></pre></div></div>
<p>Neat, ay! Note the parameter –detectOpenHandles. This parameter makes sure that any unresolved promises are properly closed when the unit test shuts down. This is to make sure we don’t unintendedly leave open some database or network connections after our test cases have completed.</p>
<p>If you’re using Visual Studio Code and would like to run Jest in debug mode, then you can also create a new runtime environment configuration. Open the file <code>.vscode/launch.json</code> and add another configuration. This is how it looks in my setup using VS Code v1.74.0:</p>
<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
</span><span class="nl">"version"</span><span class="p">:</span><span class="w"> </span><span class="s2">"0.2.0"</span><span class="p">,</span><span class="w">
</span><span class="nl">"configuration"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
</span><span class="err">...</span><span class="p">,</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"node"</span><span class="p">,</span><span class="w">
</span><span class="nl">"request"</span><span class="p">:</span><span class="w"> </span><span class="s2">"launch"</span><span class="p">,</span><span class="w">
</span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Jest debug current file"</span><span class="p">,</span><span class="w">
</span><span class="nl">"program"</span><span class="p">:</span><span class="w"> </span><span class="s2">"${workspaceFolder}/node_modules/.bin/jest"</span><span class="p">,</span><span class="w">
</span><span class="nl">"args"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s2">"${fileBasenameNoExtttension}"</span><span class="p">,</span><span class="w"> </span><span class="s2">"--detectOpenHandles"</span><span class="p">],</span><span class="w">
</span><span class="nl">"console"</span><span class="p">:</span><span class="w"> </span><span class="s2">"integratedTerminal"</span><span class="p">,</span><span class="w">
</span><span class="nl">"disableOptimisticBPs"</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="p">,</span><span class="w">
</span><span class="nl">"windows"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"program"</span><span class="p">:</span><span class="w"> </span><span class="s2">"${workspaceFolder}/node_modules/jest/bin/jest"</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="err">...</span><span class="w">
</span><span class="p">]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>
<p>This configuration allows me to run a Jest unit test from within a file I have opened in VS Code and then clicking the Run and Debug configuration called <em>Jest debug current file</em>.</p>
<h1 id="step-2-firing-up-the-firestore-local-emulator">Step 2: Firing up the Firestore local emulator</h1>
<p>The Firebase Firestore local emulator includes a very powerful Firestore database running locally on your dev machine. Its APIs are accessible by default through localhost:8080. It does not only allow creation of new collections as well as reading and writing of documents, but it is also able to emulate the authentication mechanics plus corresponding authorisation and validation rules via Firestore security rules. Hence, you can unit test all of those elements locally. Isn’t that cool? Additionally, it also has a web UI accessible by default at http://localhost:4000 that allows you to have a look at the content of your emulated Firestore database, which is very convenient for development purposes. Note that the emulator requires an installation of Java, which should be available on most computers these days. <br />
The initialization of Firebase tools for your Angular app has created several files and folders in the root folder of our Angular app:</p>
<ul>
<li>firebase.json: Contains general Firebase hosting information</li>
<li>firestore.indexes.json: Information regarding Firestore DB indexes</li>
<li>firestore.rules: This is the default file containing Firestore security rules. This is what we are going to unit test.</li>
<li>(.firebase/): A hidden folder created by Firebase tools. Not meant to be manipulated by us.</li>
<li>(firestore-debug.log): Not surprisingly a log file.</li>
</ul>
<p>If you want to make use of the web UI and inspect your data, make sure that you specify a <code>--project=<project_id></code> parameter when starting up the emulators like so. You will later on need to specify the same projectId inside your test case:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>firebase emulators:start --project=test-my-angular-app
</code></pre></div></div>
<p>If you open the file firebase.json you can see a section towards the end specifying dependencies of firestore.indexes.json and firestore.rules. If you haven’t changed the defaults then the entry looks like so:</p>
<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="err">...</span><span class="w">
</span><span class="nl">"firestore"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"rules"</span><span class="p">:</span><span class="w"> </span><span class="s2">"firestore.rules"</span><span class="p">,</span><span class="w">
</span><span class="nl">"indexes"</span><span class="p">:</span><span class="w"> </span><span class="s2">"firestore.indexes.json"</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="err">...</span><span class="w">
</span></code></pre></div></div>
<p>The emulator will thus be reading out the firebase.json file first, and then read out the firestore.rules files. It will then start a Java process which contains the actual emulated services. To shut down the emulators, simply go to your console and kill the process you started, e.g. by pressing Ctrl+C [or Cmd+C buttons on a Mac] and also closing the command window running the Java process.</p>
<h1 id="step-3-writing-the-skeleton-of-our-unit-test">Step 3: Writing the skeleton of our unit test</h1>
<p>As mentioned before I am writing my unit test in plain Javascript in a file called <code>src/firebase-test-cases/firestore.rules.test.js</code>. First, I import several packages that I’ll need in the unit test:</p>
<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">const</span> <span class="p">{</span>
<span class="nx">assertFails</span><span class="p">,</span>
<span class="nx">assertSucceeds</span><span class="p">,</span>
<span class="nx">initializeTestEnvironment</span><span class="p">,</span>
<span class="p">}</span> <span class="o">=</span> <span class="nx">require</span><span class="p">(</span><span class="dl">'</span><span class="s1">@firebase/rules-unit-testing</span><span class="dl">'</span><span class="p">);</span>
<span class="kd">const</span> <span class="p">{</span>
<span class="nx">getDoc</span><span class="p">,</span>
<span class="nx">setDoc</span><span class="p">,</span>
<span class="nx">addDoc</span><span class="p">,</span>
<span class="nx">deleteDoc</span><span class="p">,</span>
<span class="nx">collection</span>
<span class="p">}</span> <span class="o">=</span> <span class="nx">require</span><span class="p">(</span><span class="dl">'</span><span class="s1">@firebase/firestore</span><span class="dl">'</span><span class="p">);</span>
<span class="kd">const</span> <span class="nx">fs</span> <span class="o">=</span> <span class="nx">require</span><span class="p">(</span><span class="dl">'</span><span class="s1">fs</span><span class="dl">'</span><span class="p">);</span>
<span class="c1">// Our instance of RulesTestEnvironment</span>
<span class="nx">describe</span><span class="p">(</span><span class="dl">"</span><span class="s2">Test Firestore security rules</span><span class="dl">"</span><span class="p">,</span> <span class="p">()</span> <span class="o">=></span> <span class="p">{</span>
<span class="c1">// TODO</span>
<span class="p">});</span>
</code></pre></div></div>
<p>Have a look at the <code>initializeTestEnvironment</code> function. This function bootstraps our test environment object. It is asynchronous and returns a <code>Promise<RulesTestEnvironment></code> object. As an argument it takes a <a href="https://firebase.google.com/docs/reference/emulator-suite/rules-unit-testing/rules-unit-testing.md#initializetestenvironment">TestEnvironmentConfig object</a> that specifies the various parameters of the Firestore local emulator environment. Note the <em>projectId</em> parameter. This takes the same value that we specified when we started the firebase emulator with the command <code>firebase emulators:start --project=test-my-angular-app</code>. If you set a distinct value, then you will end up wondering why your local Firestore emulator web UI never seems to show any data, yet your test case successfully writes to and reads from a Firestore database that you cannot seem to see. Let’s initialize the test environment object and add some general cleanup code to be executed in between tests and at the end of all unit tests.</p>
<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">let</span> <span class="nx">testEnv</span><span class="p">;</span> <span class="c1">// Our instance of RulesTestEnvironment</span>
<span class="nx">beforeAll</span><span class="p">(</span><span class="k">async</span> <span class="p">()</span> <span class="o">=></span> <span class="p">{</span>
<span class="nx">testEnv</span> <span class="o">=</span> <span class="k">await</span> <span class="nx">initializeTestEnvironment</span><span class="p">({</span>
<span class="na">projctId</span><span class="p">:</span> <span class="dl">"</span><span class="s2">test-my-angular-app</span><span class="dl">"</span><span class="p">,</span>
<span class="na">rules</span><span class="p">:</span> <span class="nx">fs</span><span class="p">.</span><span class="nx">readFileSync</span><span class="p">(</span><span class="dl">"</span><span class="s2">firestore.rules</span><span class="dl">"</span><span class="p">,</span> <span class="p">{</span><span class="dl">"</span><span class="s2">encoding</span><span class="dl">"</span><span class="p">:</span> <span class="dl">"</span><span class="s2">utf-8</span><span class="dl">"</span><span class="p">}),</span>
<span class="na">host</span><span class="p">:</span> <span class="dl">"</span><span class="s2">localhost</span><span class="dl">"</span><span class="p">,</span>
<span class="na">port</span><span class="p">:</span> <span class="dl">"</span><span class="s2">8080</span><span class="dl">"</span>
<span class="p">}).</span><span class="nx">then</span><span class="p">(</span><span class="nx">resultTestEnv</span> <span class="o">=></span> <span class="p">{</span>
<span class="c1">// The resultTestEnv variable now contains a RulesTestEnvironment object. We must</span>
<span class="c1">// return it from the then clause to resolve the promise and assign it to the</span>
<span class="c1">// testEnv variable.</span>
<span class="k">return</span> <span class="nx">resultTestEnv</span><span class="p">;</span>
<span class="p">}).</span><span class="k">catch</span><span class="p">(</span><span class="nx">error</span> <span class="o">=></span> <span class="p">{</span>
<span class="nx">console</span><span class="p">.</span><span class="nx">error</span><span class="p">(</span><span class="nx">error</span><span class="p">);</span>
<span class="p">});</span>
<span class="c1">// TODO: Initialize Firestore with some test data</span>
<span class="p">});</span>
<span class="nx">beforeEach</span><span class="p">(</span><span class="k">async</span> <span class="p">()</span> <span class="o">=></span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="nx">testEnv</span> <span class="o">!=</span> <span class="kc">undefined</span><span class="p">)</span> <span class="p">{</span>
<span class="c1">// To clean up emulator state and data in between tests call:</span>
<span class="c1">// await testEnv.cleanup();</span>
<span class="p">}</span>
<span class="p">});</span>
<span class="nx">afterAll</span><span class="p">(</span><span class="k">async</span> <span class="p">()</span> <span class="o">=></span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="nx">testEnv</span> <span class="o">!=</span> <span class="kc">undefined</span><span class="p">)</span> <span class="p">{</span>
<span class="c1">// Remove all data from our running emulated firestore instance</span>
<span class="c1">// await testEnv.clearFirestore();</span>
<span class="p">}</span>
<span class="p">});</span>
</code></pre></div></div>
<p>There are several points worth noting. See how we are using async/await in <code>beforeAll</code>, <code>beforeEach</code> and <code>afterAll</code> functions? Each of those functions takes a callback as an argument. You must make sure to wait until those asynchronous callbacks are resolved, otherwise these functions will simply continue immediately rather than wait until the cleanup actions have completed. According to the <a href="https://firebase.google.com/docs/reference/emulator-suite/rules-unit-testing/rules-unit-testing.rulestestenvironment">docs</a> <code>RulesTestEnvironment.cleanup()</code> and <code>RulesTestEnvironment.clearFirestore()</code> return a <code>Promise<void></code> object, so we must <code>await</code> their completion too. <br />
Another important point is how we read out the firestore.rules file using <code>fs.readFileSync</code>. Instead of reading out the file we could also simply provide the rules in a string variable. This is fine for simple security rules, but your security rules can become more complicated easily when writing validation rules for your data, so keeping those rules in a separate file is definitely a better approach.</p>
<h1 id="step-4-initializing-some-data-to-our-emulated-firestore-database">Step 4: Initializing some data to our emulated Firestore database</h1>
<p>Great, we can now create a test environment with which to work. But how do we add some initial data to our database, so that our unit tests have some initial data to work with? Problem is that we are still in the setup phase, so we don’t want to perform some potentially complicated authentication first. How to solve that? The solution is yet another function: <code>RulesTestEnvironment.withSecurityRulesDisabled(callback)</code>. This function executes a callback function that returns a <code>Promise<void></code> and to which it provides a <em>context</em> argument. The context argument can be used to obtain a reference to the local Firestore object by calling <code>context.firestore()</code> - but with security rules entirely disabled. We effectively circumvent the Firestore authentication here in our emulated Firestore environment! Sounds complicated, so let’s look at an example. A very frequent need is to have a <em>users</em> collection with some documents containing information on our users such as full name, address, maybe a URL to a profile picture and other user info.</p>
<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Within the beforeAll function after having initalized testEnv variable:</span>
<span class="k">await</span> <span class="nx">testEnv</span><span class="p">.</span><span class="nx">withSecurityRulesDisabled</span><span class="p">(</span><span class="k">async</span> <span class="p">(</span><span class="nx">ctx</span><span class="p">)</span> <span class="o">=></span> <span class="p">{</span>
<span class="c1">// Create a new object to persist</span>
<span class="kd">const</span> <span class="nx">newDoc</span> <span class="o">=</span> <span class="p">{</span> <span class="na">userId</span><span class="p">:</span> <span class="dl">"</span><span class="s2">1234567</span><span class="dl">"</span><span class="p">,</span> <span class="na">username</span><span class="p">:</span> <span class="dl">"</span><span class="s2">Bob Firefighter</span><span class="dl">"</span><span class="p">,</span> <span class="na">isAdmin</span><span class="p">:</span> <span class="kc">false</span> <span class="p">};</span>
<span class="c1">// Obtain the reference to the Firestore collection</span>
<span class="kd">let</span> <span class="nx">dbRef</span> <span class="o">=</span> <span class="nx">collection</span><span class="p">(</span><span class="nx">ctx</span><span class="p">.</span><span class="nx">firestore</span><span class="p">(),</span> <span class="dl">"</span><span class="s2">users</span><span class="dl">"</span><span class="p">);</span>
<span class="c1">// Write the new document to the database</span>
<span class="kd">let</span> <span class="nx">docRef</span> <span class="o">=</span> <span class="k">await</span> <span class="nx">addDoc</span><span class="p">(</span><span class="nx">dbRef</span><span class="p">,</span> <span class="nx">newDoc</span><span class="p">)</span>
<span class="p">.</span><span class="nx">then</span><span class="p">(</span><span class="nx">resultDocRef</span> <span class="o">=></span> <span class="p">{</span>
<span class="nx">console</span><span class="p">.</span><span class="nx">log</span><span class="p">(</span><span class="nx">resultDocRef</span><span class="p">.</span><span class="nx">id</span><span class="p">);</span>
<span class="k">return</span> <span class="nx">resultDocRef</span><span class="p">;</span>
<span class="p">})</span>
<span class="p">.</span><span class="k">catch</span><span class="p">(</span><span class="nx">error</span> <span class="o">=></span> <span class="p">{</span>
<span class="nx">console</span><span class="p">.</span><span class="nx">error</span><span class="p">(</span><span class="nx">error</span><span class="p">);</span>
<span class="p">});</span>
<span class="p">})</span>
<span class="p">.</span><span class="nx">then</span><span class="p">(()</span> <span class="o">=></span> <span class="p">{</span>
<span class="nx">console</span><span class="p">.</span><span class="nx">log</span><span class="p">(</span><span class="dl">"</span><span class="s2">Initialization of test data successfully completed</span><span class="dl">"</span><span class="p">);</span>
<span class="p">})</span>
<span class="p">.</span><span class="k">catch</span><span class="p">(</span><span class="nx">error</span> <span class="o">=></span> <span class="p">{</span>
<span class="nx">console</span><span class="p">.</span><span class="nx">error</span><span class="p">(</span><span class="dl">"</span><span class="s2">Initialization of test data failed with error: </span><span class="dl">"</span><span class="p">,</span> <span class="nx">error</span><span class="p">);</span>
<span class="p">});</span>
</code></pre></div></div>
<h1 id="step-5-observe-your-data-in-the-firestore-emulator-web-ui">Step 5: Observe your data in the Firestore emulator web UI</h1>
<p>Once you’ve started the Firebase emulators, you will get access to a web UI running at http://localhost:4000. Just enter the URL into your browser and navigate to the Firestore tab. There you can inspect which data are currently loaded into your database. Of course, if you wipe out your data immediately in a <em>beforeEach</em> or <em>afterAll</em> function in your test case then you will never see any data appearing as it’s immediately cleaned out after you finished your test case.</p>
<h1 id="step-6-write-some-test-cases">Step 6: Write some test cases</h1>
<p>We are ready to create some actual test cases. We have two options here. We can either rely on <a href="https://jestjs.io/docs/asynchronous">Jest’s own asynchronous expect functions</a>, or we rely on <a href="https://firebase.google.com/docs/reference/emulator-suite/rules-unit-testing/rules-unit-testing">rules-unit-test’s assertSucceeds or assertFails functions to test permission granted or denied situations</a>. We did not really go into how to write the security rules and validation rules so far, but let’s imagine there is one <a href="https://firebase.google.com/docs/reference/rules/rules.String#matches">validation rule deployed on the server applying a regular expression that checks whether an email provided is valid</a>. Typically, immediately after a user is registered in Firestore, a new request is sent to the server to create a new user object with some additional user information.</p>
<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nx">test</span><span class="p">(</span><span class="dl">"</span><span class="s2">username must be of type string</span><span class="dl">"</span><span class="p">,</span> <span class="k">async</span> <span class="p">()</span> <span class="o">=></span> <span class="p">{</span>
<span class="c1">// Here we attempt to send a new user doc to the server that has an invalid</span>
<span class="c1">// username type of number rather than string</span>
<span class="kd">const</span> <span class="nx">newDoc</span> <span class="o">=</span> <span class="p">{</span> <span class="na">userId</span><span class="p">:</span> <span class="dl">"</span><span class="s2">7777777</span><span class="dl">"</span><span class="p">,</span> <span class="na">email</span><span class="p">:</span> <span class="dl">"</span><span class="s2">invalid@email</span><span class="dl">"</span><span class="p">,</span> <span class="na">isAdmin</span><span class="p">:</span> <span class="kc">false</span> <span class="p">};</span>
<span class="kd">const</span> <span class="nx">ctx</span> <span class="o">=</span> <span class="nx">testEnv</span><span class="p">.</span><span class="nx">authenticatedContext</span><span class="p">({</span><span class="na">userId</span><span class="p">:</span> <span class="dl">"</span><span class="s2">1234567</span><span class="dl">"</span><span class="p">});</span>
<span class="kd">const</span> <span class="nx">dbRef</span> <span class="o">=</span> <span class="nx">collection</span><span class="p">(</span><span class="nx">ctx</span><span class="p">.</span><span class="nx">firestore</span><span class="p">(),</span> <span class="dl">"</span><span class="s2">users</span><span class="dl">"</span><span class="p">);</span>
<span class="k">await</span> <span class="nx">assertFails</span><span class="p">(</span><span class="nx">addDoc</span><span class="p">(</span><span class="nx">dbRef</span><span class="p">,</span> <span class="nx">newDoc</span><span class="p">));</span>
<span class="p">});</span>
</code></pre></div></div>
<h1 id="step-7-deploy-your-locally-tested-firestore-security-rules-to-the-production-server">Step 7: Deploy your locally tested Firestore security rules to the production server</h1>
<p>Finally, you want to <a href="https://firebase.google.com/docs/rules/manage-deploy">deploy those security rules back to the production server</a> hosted on Google Firebase by calling:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>firebase deploy --only firestore:rules
</code></pre></div></div>
<hr />
<p>References</p>
<ul>
<li>Angular: <a href="">https://angular.io/</a></li>
<li>Jest: <a href="">https://jestjs.io/</a></li>
<li>API reference of @firebase/rules-unit-test node package: <a href="">https://firebase.google.com/docs/reference/emulator-suite/rules-unit-testing/rules-unit-testing</a></li>
<li>Firebase CLI docs: <a href="">https://firebase.google.com/docs/cli</a></li>
<li>Sample projects howo to test Firestore security rules: <a href="">https://github.com/firebase/quickstart-testing/tree/master/unit-test-security-rules</a></li>
<li>Javascript API reference of @firebase/firestore node package: <a href="">https://firebase.google.com/docs/reference/js/firestore_</a></li>
<li>Using Firebase Firestore local emulator suite to test Firestore security rules: <a href="">https://firebase.google.com/docs/rules/emulator-setup</a></li>
<li>RulesTestEnvironment API reference: <a href="">https://firebase.google.com/docs/reference/emulator-suite/rules-unit-testing/rules-unit-testing.rulestestenvironment</a></li>
<li>Cloud Firestore Security Rules API reference: <a href="">https://firebase.google.com/docs/reference/rules/index-all</a></li>
<li>Another useful tutorial that is still using Firebase web v8 rather than v9: <a href="">https://medium.com/firebase-developers/develop-your-firestore-with-tdd-unit-testing-security-rules-afefb0d772c4</a></li>
</ul>
Do not use Karma & Jasmine for unit testing Firestore Security Rules in your Angular app!2022-12-08T00:00:00+01:00http://fabian-kostadinov.github.io/2022/12/08/do-not-use-karma-and-jasmine-to-write-unit-tests-for-firestore-security-rules-in-angular-14<p>I’m writing an Angular 14 app with Google Firebase Firestore as a database. I assumed that developing the Firestore Security Rules with Firebase web v9 by writing Karma/Jasmine unit tests should be not too hard. After all, Karma is Angular 14’s default test runner, no? Turns out that using Karma to develop or test Firestore Security Rules is a bad idea. You should use Jest (or maybe Mocha) instead.<span class="more"></span></p>
<h1 id="why-using-karma-is-a-bad-idea-when-writing-unit-tests-for-firestore">Why using Karma is a bad idea when writing unit tests for Firestore</h1>
<p>Angular 14 comes with <a href="https://karma-runner.github.io/">Karma</a> & <a href="https://jasmine.github.io/">Jasmine</a> as its built-in testing framework. To be more precise: Karma is the test runner framework, and Jasmine the actual behaviour-driven unit testing framework. Of course you can switch to a different setup like Jest or Mocha, but if Karma is already the default framework it’s just natural to take this as a starting point for writing unit tests. <br />
The <a href="https://karma-runner.github.io/6.4/intro/how-it-works.html">Karma documentation</a> tells us a crucial detail:</p>
<blockquote>
<p>Karma is essentially a tool which spans a web server […]
After starting up, Karma loads plugins and the configuration file, then starts its local web server which listens for connections. Any browser already waiting on websockets from the server will reconnect immediately. As part of loading the plugins, test reporters register for ‘browser’ events so they are ready for test results.
Then karma launches zero, one, or more browsers […]</p>
</blockquote>
<p>If your goal is to write and unit test Firestore Security Rules locally using Firebase emulators than these are bad news. Why? In short: Because your unit test will end up running in a browser (!) process trying to connect to a server process, and as such is not allowed to access the local file system. And that’s most likely what you want to do when trying to develop and unit test Firestore Security Rules. To understand this we need to highlight a few things about Firestore Security Rules and the Firebase emulators.</p>
<p>Some time ago Google introduced <a href="https://firebase.google.com/docs/rules/emulator-setup">Firebase emulators</a> to develop both your frontend SPA application against the Firestore backend locally. There are different reasons why this is convenient for every developer, among them being no need to make changes to an existing production database, no costs involved for running queries on Google’s servers, and speedier development without the need for an internet connection. <br />
Using Firebase tools we can put all our Firestore security rules in a local <code>firestore.rules</code> file, test the security rules locally, and only deploy that file to the server once it’s ready. <br />
According to the documentation how to unit test Firestore Security Rules you can initialise a test environment like so:</p>
<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">let</span> <span class="nx">testEnv</span> <span class="o">=</span> <span class="k">await</span> <span class="nx">initializeTestEnvironment</span><span class="p">({</span>
<span class="na">projectId</span><span class="p">:</span> <span class="dl">"</span><span class="s2">demo-project-1234</span><span class="dl">"</span><span class="p">,</span>
<span class="na">firestore</span><span class="p">:</span> <span class="p">{</span>
<span class="na">rules</span><span class="p">:</span> <span class="nx">fs</span><span class="p">.</span><span class="nx">readFileSync</span><span class="p">(</span><span class="dl">"</span><span class="s2">firestore.rules</span><span class="dl">"</span><span class="p">,</span> <span class="dl">"</span><span class="s2">utf8</span><span class="dl">"</span><span class="p">)</span>
<span class="p">},</span>
<span class="p">});</span>
</code></pre></div></div>
<p>The problem is: This does not work in a browser process. The browser’s security disallows you from accessing the local file system, hence you cannot use <code>fs</code> to read out a file.</p>
<p>You could of course put all the security rules in a local string variable and assign that in the parameter object given to <code>initializeTestEnvironment</code> instead of reading it from the file system. But if your security rules grow in complexity, this is a very poor workaround - not least because you want do actually deploy the entire <code>firestore.rules</code> file to the server using the corresponding Firebase CLI commands:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>firebase deploy --only firestore:rules
</code></pre></div></div>
<p>Thus, you really do not want to keep your security rules in a simple string variable within your unit test. <br />
Equally dissatisfactory is to write and test your security rules in the Firestore web UI, as it’s slow and cumbersome if your rules are anything else than trivial.</p>
<p>And that’s why I recommend switching to Jest instead. Jest works differently than Karma behind the scenes. I’ll describe the development process in more details in another article.</p>
How I found and cashed in a bitcoin wallet from 20112021-01-07T00:00:00+01:00http://fabian-kostadinov.github.io/2021/01/07/how-i-found-and-cashed-in-a-bitcoin-wallet-from-2011<p>A few days ago I found a Bitcoin wallet.dat on a Macbook from 2011. Here is how I managed to cash in on it.<span class="more"></span></p>
<h1 id="a-bitcoin-wallet-from-2011">A Bitcoin wallet from 2011…</h1>
<p>In 2011 my employer sent me and a co-worker to the Netherlands for two weeks to work on a research project. There, for the first time, someone told me about Bitcoin. While I found the ideas fascinating I could not make a lot of sense of it. And certainly there was no way how to ever either buy something or transfer any bitcoins to real fiat money. Nevertheless, I installed Bitcoin core v0.3.19 BETA on my MacBook Pro, and somehow managed to get my hands on a fraction of a bitcoin. Today, the value of that fraction is enough to buy me new laptop or go on a vacation for some days. I don’t quite remember whether I managed to get those fractions through mining, or whether one of the members of the research project transferred it to my account as a small gift. Certainly, back then it did not have a lot of value.</p>
<h1 id="appears-again-in-2021">…appears again in 2021.</h1>
<p>Fast forward. I am still using said MacBook from 2011 to watch movies every now and then as it has an in-built DVD player and actually works more or less flawlessly still today. I never bothered to re-format its hard disk, and that’s my luck. While opening the DVD player application I suddenly spotted an installation of Bitcoin Core that I have entirely managed to ignore throughout all those years. I started the application and found an address. With 0 bitcoins. I browsed a bit further and realized that there was also a second address, not displayed with default settings, and that second address claimed to contain said fractions of a bitcoin.</p>
<p>I was thrilled. I went to the internet to check. The address indeed contained some fractions of a bitcoin. I searched my file system for a wallet.dat file and got lucky. There it was, under <code>~/Library/Application Support/Bitcoin/</code>.</p>
<p>First, I made sure all my Wifi and Bluetooth connections on the laptop were turned off, and the laptop was not connected to the internet. Better safe than sorry. Second, I created a safety copy of the entire directory and put it on a memory stick. Third, I removed all access rights from the wallet.dat file using <code>sudo chmod a-rwx wallet.dat</code> to ensure nobody, except myself, could by mistake or with bad intent read out or modify the wallet.</p>
<p>And then the real work began. Note: I have never really looked into the whole Bitcoin affair more in-depth. I’m actually skeptical that it’s so secure as many people believe simply due to the fact that the big miners actually build something sort of <a href="https://www.blockchain.com/pools?">an oligopoly</a>. <a href="https://www.forbes.com/sites/rogerhuang/2021/12/29/the-chinese-mining-centralization-of-bitcoin-and-ethereum/?sh=55c0871a2f66">Some argue</a> the large-scale bitcoin miners in China/Asia may not be able or willing to manipulate the blockchain despite the fact there are not too many of them left. But that’s a different topic. What I wanted was to get cash out of the BTCs in my wallet. How is it done?</p>
<p>The first problem I faced was that the installed version of Bitcoin Core v0.3.19 BETA, irrespective of whether it would have worked still or not (probably not, but I did not even bother to check), I did not consider to be secure anymore. Also my laptop did not have any updates in years, so I did not consider the laptop to be secure neither.</p>
<p>And, to complicate things, I was not even sure whether my wallet was encrypted or not, and if so, what the password would have been. I might have been able to guess it, but I was not sure.</p>
<p>I own a new MacBook Pro, so in theory I could have installed a new version of Bitcoin Core there and then try to import the old wallet.dat. But the problem was that Bitcoin Core these days require at least 320 GB of disk space, and I only have 512 GB available, with roughly half of it already filled. So, there simply was not enough disk space available.</p>
<p>After much reading I decided to install <a href="https://electrum.org/">Electrum</a> wallet instead. The difference between Bitcoin Core and Electrum is that the latter is a so-called <a href="https://en.bitcoinwiki.org/wiki/Simplified_Payment_Verification">simplified payment verification (SPV) wallet</a>. In short, the consequence is that Electrum does not require to download the total 320 GB of blockchain history, but instead is a fully functional wallet that is still sufficiently secure for the majority of purposes, although not exactly as secure as the Bitcoin Core wallet. At least for the fraction of bitcoins I owned it seemed to be secure enough.
I was a bit paranoid, so I even checked the checksum using <a href="https://gpgtools.org/">GPG Suite</a> after downloading it to ensure it hadn’t been fiddled around with. Apparently, some time ago there had existed a fake website called electrum.li which offered a manipulated version of Electrum, so some paranoia might actually be justified.</p>
<p>Next problem was that I had no clue how to import my 2011 Bitcoin wallet.dat into Electrum. I searched around and found <a href="https://medium.com/@simonvc/i-found-an-old-bitcoin-wallet-1aeb2f32387a">this medium.com post from 2017</a> that in the end turned out to be quite helpful yet leaving out some details. From what I understood from this site and other related material there was simply no way how to import my old wallet.dat file into Electrum, they were incompatible.</p>
<p>Instead, it was suggested to use <em>pywallet</em>, a Python 2.7 program that could read out old wallet.dat files from 2011 such as mine. The author of referred medium article had used a version of <em>pywallet</em> available from <a href="">https://github.com/jackjack-jj/pywallet/blob/master/pywallet.py</a>. I cloned it into my local file system…</p>
<p>…and faced the next problem. According to the <a href="https://github.com/jackjack-jj/pywallet">installation instructions</a> it was required to pip install several Python 2.7 libraries that were entirely straight forward to install. My version of Mac OS X Catalina (10.15.7) is shipped with a Python version 2.7. However, I did not want to fiddle around with that version to avoid overwriting some system libraries with newer versions by mistake and potentially cause some Mac OS X processes to fail. Unfortunately, Python 2.7 does not properly support virtual environments. So, I decided to follow the MacPorts route.</p>
<p>This is from pywallet’s installation instructions:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Mac OS X:
1. Install MacPorts from http://www.macports.org/
2. sudo port install python27 py27-twisted py27-pip py-bsddb python_select
3. sudo port select --set python python27
4. sudo easy_install ecdsa
</code></pre></div></div>
<p>You might also run into issues if you don’t have <a href="https://apps.apple.com/us/app/xcode/id497799835?mt=12">xcode</a> installed, which I fortunately had and did not have to bother about that any further.</p>
<p>I decided not to actually switch the Python version with MacPorts, but simply directly refer to the installed Python 2.7 version that is different from my system’s pre-installed one. This new installation was to be found in <code>/opt/local/Library/Frameworks/Python.framework/Versions/2.7/</code>, and there was a python2 executable in the <code>./bin</code> folder.</p>
<p>On my new laptop I had created a copy of my old laptop’s whole directory with the wallet.dat file. Then I ran pywallet:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>/opt/local/Library/Frameworks/Python.framework/Versions/2.7/bin/python2 ~/<path to pywallet>/pywallet.py <span class="nt">--dumpwallet</span> <span class="nt">--datadir</span><span class="o">=</span>~/<path to copy of Bitcoin wallet directory> <span class="o">></span> ~/wallet.dat.json
</code></pre></div></div>
<p>I was not sure this would work, because the wallet.dat might have been password protected. I decided to try my luck…</p>
<p>…and was lucky! It worked like a charm. As it turned out, most wallets from Bitcoin Core from 2011 are actually not encrypted. Apparently, this was not enforced back then by Bitcoin Core v0.3.19 BETA, and I was simply able to open my json file and read out its content.</p>
<p>The json file contained lots and lots of different addresses, including their private and public keys. What a mess. A typical entry looked like so:</p>
<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
</span><span class="nl">"addr"</span><span class="p">:</span><span class="w"> </span><span class="s2">"<this is an address>"</span><span class="p">,</span><span class="w">
</span><span class="nl">"compressed"</span><span class="p">:</span><span class="w"> </span><span class="kc">false</span><span class="p">,</span><span class="w">
</span><span class="nl">"hexsec"</span><span class="p">:</span><span class="w"> </span><span class="s2">"<64 hex char string>"</span><span class="p">,</span><span class="w">
</span><span class="nl">"private"</span><span class="p">:</span><span class="w"> </span><span class="s2">"<558 hex char string>"</span><span class="p">,</span><span class="w">
</span><span class="nl">"pubkey"</span><span class="p">:</span><span class="w"> </span><span class="s2">"<130 hex char string>"</span><span class="p">,</span><span class="w">
</span><span class="nl">"reserve"</span><span class="p">:</span><span class="w"> </span><span class="mi">1</span><span class="p">,</span><span class="w">
</span><span class="nl">"sec"</span><span class="p">:</span><span class="w"> </span><span class="s2">"<51 char string>"</span><span class="p">,</span><span class="w">
</span><span class="nl">"secret"</span><span class="p">:</span><span class="w"> </span><span class="s2">"<64 char string>"</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>
<p>After plenty of reading up I finally understood that Electrum required the 51 character string in the “sec” field. This is in essence a <em>wallet import format</em> (WIF) Base58 encoded private key. In my situation, having still access to the old Bitcoin Core installation from 2011 I also knew exactly which address would contain the bitcoins. If I had not known that then I would have needed to go through the pain of extracting all addresses with a script and checking all of against some public website or service to see which one actually contains the bitcoins. This is described in just superficial details in already mentioned <a href="https://medium.com/@simonvc/i-found-an-old-bitcoin-wallet-1aeb2f32387a">medium.com article</a>. Pywallet actually contains a command for this purpose, where the blockchain address corresponds to the value of the “addr” field above.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>python2 pywallet.py --balance=<blockchain address>
</code></pre></div></div>
<p>Also, you might need to build in a small waiting time between requests to not get blocked by the blockchain information provider service.</p>
<p>In my case I could simply search for the corresponding entry in the json file.</p>
<p>With that information I could not start my Electrum application. I had installed Electrum v4.x. After starting, I first selected <em>create new wallet</em>, provided a name, and then selected <em>Import Bitcoin addresses or private keys</em>. At some point in the process I also had to select a strong password if I remember correctly.</p>
<p><img src="/public/img/2021-01-07-electrum-import-private-key.png" alt="Import private key into Electrum" /></p>
<p>In the input field I only needed to enter the 51 character long WIF Base58 encoded private key, then I could click the <em>next</em> button.</p>
<p>Now I was ready. What confused me first was that Electrum does not show the addresses themselves, you need to explicitly enable those in the menu under <em>> View > Show Addresses</em>.</p>
<p><img src="/public/img/2021-01-07-electrum-show-addresses.png" alt="Show addresses in Electrum" /></p>
<p>Great. No import of 320 GB data needed at only minimal loss of trust. (It might be wise for you to read up on Electrum’s security first and which servers it connects to and what further options it offers you.)</p>
<p>Now I needed to find an address to send my BTCs to, then exchange the BTCs to my local currency, and finally withdraw the money. The situation has changed drastically from 2011 - but it’s still not a painfree process.</p>
<p>First, I checked some local bitcoin ATMs that I had read about in the press. Well, if you are prepared to spend some outrageous commissions of 3.5% - 5.7% of the amount of money withdrawn PLUS additional fees for whatever PLUS accept that sometimes the ATMs are just empty without notifying you PLUS accept that some of them don’t let you withdraw more money than ca. 1000 CHF at a time PLUS some random error messages, then this might be a truly great option for you. For me it wasn’t.
I then began researching on trading platforms and how they work. A co-worker suggested me to try out <a href="https://lykke.com/">Lykke</a> claiming they did not charge extra fees to transfer fiat money to any bank account through the SWIFT payment system. I briefly also checked out Bitcoinsuisse.com who notified me at some stages during the registration process that the minimum deposit be 10k CHF. Thanks, but no thanks. I also checked out kraken.com but finally decided to set up an account on lykke.com.
Registration process went more or less smooth, just some minor hiccups that apparently my free email I’ve been using for years was silently not accepted without any error messages being displayed anywhere. This caused me to scratch my head a bit, but ultimately I settled for a gmail address instead and that worked. But that only led me to the very basic account with which you can do essentially nothing. To be able to trade bitcoins to fiat money and then to withdraw I had to provide a foto of my identity card, upload some bank statement as proof, take a selfie, give them my fingerprints and DNA sample and I was good to go. Well, the last two are an exaggeration, but you get the drill.</p>
<p>With all that done I waited for ca. 4 hours, then received another email that my application was accepted and voilà.</p>
<p>The web tool was really easy to use. There was a blockchain address given to me to send my bitcoins to, and then, with much trembling and praying, from my Electrum application I sent all my bitcoins to said blockchain address in my Lykke account. This cost me a small fraction of my BTCs in accordance to how the whole blockchain protocol for bitcoin works. (If you don’t know what I’m talking about here, you may want to read up on <a href="https://bitcoinelectrum.com/how-to-manually-set-transaction-fees/">Electrum ETA</a>. Transactions on blockchain are not free, but they are not too expensive neither.) Maybe I should have started with sending only a fraction of it, I pondered afterwards, while waiting until the blockchain had accepted my payment. But after ca. 30 minutes I had enough confirmations and the bitcoins did arrive at my Lykke account. :D</p>
<p>Next step was to trade them from BTCs to CHFs. Trading tool was easy to use if you are used to that kind of thing. If not, I highly recommend you get some help. There probably is not too much you can do wrong - other than inadvertedly buying or selling the wrong crypto-currency or the wrong type of fiat money. The trade took ca. 30 minutes to be settled, with many partial fillings until someone purchased the big rest of my advertised BTCs.</p>
<p>And now I had fiat money on my Lykke account. I haven’t yet tried transfering it with SWIFT to my bank account, but the functionality is really simply, and from what I’ve understood there should not arise any additional charges. In comparison to the ATMs the whole process was nearly free.</p>
Basics of vector algebra2019-12-27T00:00:00+01:00http://fabian-kostadinov.github.io/2019/12/27/basics-of-vector-algebra<p>Understanding vector algebra is a prerequisite to selecting meaningful distance metrics for text embeddings. For the fun of it, let’s recall some of the basics.<span class="more"></span></p>
<h1 id="basics-of-vector-algebra">Basics of vector algebra</h1>
<p>Let <strong>p</strong> and <strong>q</strong> be each a n-dimensional vector in a n-dimensional Euclidean space.</p>
<h2 id="addition-and-subtraction-of-vectors">Addition and subtraction of vectors</h2>
<p>Addition and subtraction of vectors of equal length is quite straight forward. Both operations produce a new vector with same number of components.
<img src="/public/img/2019-12-27-addition-subtraction.png" alt="Addition and Subtraction Formula" title="Addition and Subtraction Formula" /></p>
<h2 id="norms-of-a-vector">Norms of a vector</h2>
<p>The “length” or “magnitude” of a vector can be defined for a vector in various ways, we must at least distinguish between the level 1 and level 2 norms.</p>
<h3 id="level-1-norm">Level 1 Norm</h3>
<p><strong>Level 1 Norm (L1)</strong> is the sum of the absolute values of the vector’s components.
<img src="/public/img/2019-12-27-level1-norm.png" alt="Level 1 Norm Formula" title="Level 1 Norm Formula" />
The level 1 norm is always a scalar number, i.e. a singular value. For real valued vectors it cannot take a negative value, and it can only be 0 when all its componenrts are 0.</p>
<h3 id="level-2-norm-euclidean-norm">Level 2 norm (Euclidean norm)</h3>
<p><strong>Level 2 Norm (L2)</strong> or also called <strong>Euclidean Norm</strong> is the square root of the sum of the squared (absolute) values of the vector’s components.
<img src="/public/img/2019-12-27-level2-norm.png" alt="Level 2 Norm Formula" title="Level 2 Norm Formula" />
The level 2 norm is always a scalar number, i.e. a singular value. For real valued vectors it cannot take a negative value, and it can only be 0 when all its components are 0.</p>
<p>Note that in some cases the L1 norm is also written as |<strong>p</strong>| and the L2 norm as ||<strong>p</strong>|| without the extra subscript. The L2 norm is so common that it is often referred to simply as “the norm” without indicating that the L2 norm is referred to rather than L1 norm.</p>
<h3 id="generalized-norm">Generalized norm</h3>
<p>Besides L1 and L2, in principle we can also calculate L3, L4 and even L∞ norms. We can generalize the above formulas to the following one:
<img src="/public/img/2019-12-27-levelk-norm.png" alt="Level k Norm Formula" title="Level k Norm Formula" /></p>
<h2 id="multiplication-of-vectors">Multiplication of vectors</h2>
<p>Vectors can be multiplied in different ways, the most basic ones are the scalar multiplication and the dot product. Another relatively common multiplication is the cross product (also called vector product).</p>
<h3 id="scalar-multiplication">Scalar multiplication</h3>
<p>The <strong>scalar multiplication</strong> is a multiplication of a vector <strong>p</strong> with a single number <em>c</em>. Scalar multiplication always returns a vector with the same number of components as the original. Geometrically speaking this can be interpreted as a re-scaling of the vector’s length (it can even make it point in the opposite direction thus returning the <em>inverse vector</em>) but beyond that does not change its fundamental directionality.
<img src="/public/img/2019-12-27-scalar-multiplication.png" alt="Scalar Multiplication Formula" title="Scalar Multiplication Formula" />
Scalar multiplication with 0 returns the somewhat boring <strong>0</strong> vector: 0 <strong>p</strong> = <strong>0</strong>.</p>
<h3 id="element-wise-multiplication">Element-wise multiplication</h3>
<p>The <strong>Hadamard product</strong> or also called <strong>element-wise multiplication</strong> is a component by component multiplication of two vectors <strong>p</strong> and <strong>q</strong> with <em>n</em> dimensions. Element-wise multiplication returns a vector with the same dimensionality as the two originals. Giving a <a href="https://math.stackexchange.com/questions/2754366/geometric-interpretation-of-element-wise-vector-multiplication">geometric explanation</a> is tricky, so I will skip this here. Its definition on the other hand is quite simple.
<img src="/public/img/2019-12-27-element-wise-multiplication.png" alt="Element-Wise Multiplication Formula" title="Element-Wise Multiplication Formula" /></p>
<h3 id="dot-product">Dot product</h3>
<p>The <strong>dot product</strong> (also called <strong>scalar product</strong>) is a multiplication of two vectors <strong>p</strong> and <strong>q</strong> of same dimensionality. Algebraic dot product as expressed by me:</p>
<blockquote>
<p>The [algebraic] dot product of two vectors can be defined as the sum of all multiplied components of two vectors.</p>
</blockquote>
<p>Geometric dot product as expressed by <a href="https://en.wikipedia.org/wiki/Multiplication_of_vectors">Wikipedia</a>:</p>
<blockquote>
<p>The [geometric] dot product of two vectors can be defined as the product of the magnitudes of the two vectors and the cosine of the angle [θ] between the two vectors.
<img src="/public/img/2019-12-27-dot-product.png" alt="Dot Product Formula" title="Dot Product Formula" /></p>
</blockquote>
<p>As you can see there exists both an algebraic and a geometric version of the dot product. You can find a mathematical proof online in various places that the two indeed lead to same results, i.e. algebraic dot product = geometric dot product. The dot product is very important. It contains information about the angle (θ) between two vectors. If they are orthogonal to each other then the dot product is 0. If they are pointing into the same direction, i.e. when they are parallel) the dot product will be as positive as it gets. If they point into the opposite direction (also parallel to each other) the dot product will be as negative as it gets. (By the way, many math libraries require the angle θ to be converted into a radians number before entering it as a parameter to the cosine function.)</p>
<p>Did you also notice what I just claimed in above formula? I wrote: <strong>p</strong>∙<strong>q</strong> = <<strong>p</strong>,<strong>q</strong>>. This means: The dot product of two vectors <strong>p</strong>∙<strong>q</strong> is equal to their inner product <<strong>p</strong>,<strong>q</strong>>. This rule generally works for the text embedding algorithms I’m familar with, but if we want to be very precise then the inner product might be defined differently than the dot product, especially if you are leaving the world of Euclidean spaces. See <a href="https://math.stackexchange.com/questions/476738/difference-between-dot-product-and-inner-product">here</a> and <a href="https://math.stackexchange.com/questions/3188850/alternate-inner-products-on-euclidean-space">here</a>. (Don’t worry, we won’t dive into such territory here.)</p>
<p>We can also rewrite the geometric interpretation of the dot product formula to calculate the angle θ by its vectors:
<img src="/public/img/2019-12-27-angle-between-vectors-formula.png" alt="Angle between vectors" title="Angle between vectors" /></p>
<h3 id="cross-product">Cross product</h3>
<p>The <strong>cross product</strong> (also called <strong>vector product</strong>) is also a multiplication of two vectors of same dimensionality, but a different one than the dot product. It is only defined for vectors with at least 3 components, i.e. it cannot be applied in 1- or 2-dimensional Euclidean spaces. (The following formula gives the algebraic definition for 3 dimensions only.)
<img src="/public/img/2019-12-27-cross-product-formula.png" alt="Cross Product" title="Cross Product" />
The new vector <strong>z</strong> (also often denoted as <strong>n</strong>) is the <strong>unit vector</strong> that is orthogonal to the planar area spanned by vectors <strong>p</strong> and <strong>q</strong>. A unit vector has a magnitude (L2 norm) of 1, i.e. ||<strong>z</strong>||<sub>2</sub> = 1. Multiplication with <strong>z</strong> ensures that the cross product vector points into the “right” direction. The geometric version of the cross product is usually used to determine the angle θ between two vectors, that is not in the form written above. As it is of less importance to my concerns (text embeddings!) I don’t want to go into details. You can find a <a href="http://tutorial.math.lamar.edu/Classes/CalcII/CrossProduct.aspx">short introduction for 3D spaces here</a>, a <a href="https://en.wikipedia.org/wiki/Cross_product">nice video illustration on Wikipedia</a>, and a <a href="https://math.stackexchange.com/questions/185991/is-the-vector-cross-product-only-defined-for-3d">short discussion on how to generalize to more dimensions on Stackexchange</a>.</p>
<p>We could also say that the dot product focuses interactions of vector components in the same dimension, whereas the cross product focuses on vector components in distinct dimensions. (<a href="https://betterexplained.com/articles/cross-product/">More explanation here</a>.)</p>
<h2 id="division-of-vectors">Division of vectors</h2>
<p>Logically speaking, division of vectors should be the inverse to multiplication, no? Immediately questions arise. As we have seen, there are at least three different types of multiplication.</p>
<h3 id="division-by-a-scalar">Division by a scalar</h3>
<p><strong>Division of a vector by a scalar</strong> is rather simple - as long as the scalar is not zero! In fact, we could redefine division by a scalar as multiplication with 1/scalar (where scalar ≠ 0). Thus:
<img src="/public/img/2019-12-27-division-by-scalar.png" alt="Division by Scalar Formula" title="Division by Scalar Formula" /></p>
<h3 id="element-wise-division">Element-wise division?</h3>
<p><strong>Element-wise division</strong> is already more complicated. In theory, this could be achieved as long as the divisor vector does not contain any elements that are zero. In reality however this is much harder to control than when dividing by a scalar. So, we won’t pursue this any further.</p>
<h3 id="division-for-inverse-of-dot-and-cross-product">Division for inverse of dot and cross product?</h3>
<p>And, what about division as the inverse of the dot and cross product? Setting the potential problem of division by zero aside for a moment, it turns out such a thing does not exist (see e.g. <a href="https://math.stackexchange.com/questions/246594/what-is-vector-division">here</a> or <a href="https://www.quora.com/Can-we-divide-a-vector-by-a-vector-and-why">here</a> or <a href="http://mathworld.wolfram.com/VectorDivision.html">here</a>).</p>
<h2 id="vector-unit-normalization">Vector unit-normalization</h2>
<p>A (non-zero) vector’s length can be unit-normalized. The result is a (rescaled) vector of same dimensionality as the original with magnitude (L2 norm) of 1. The new vector is called a “unit vector” or “normalized vector”. Note that we divide each component of <strong>p</strong> with the scalar <strong>p</strong>’s magnitude. Hence, ||<strong>p̂</strong>||<sub>2</sub> = 1.
<img src="/public/img/2019-12-27-unit-vector-formula.png" alt="Unit Vector Formula" title="Unit Vector Formula" /></p>
<h1 id="similarity-and-distance-measures-for-two-vectors">Similarity and distance measures for two vectors</h1>
<p>When working with text embeddings a common need is to measure the (dis-) similarity or distance between two vectors. In text embeddings a very common problem is to find the nearest neighbours of a given vector: Which out of all my other vectors are the most similar to a given vector? To answer this question we first have to establish a distance or similarity measure. We will continue to operate in an Euclidean space only, many other distance measures exist in non-Euclidean spaces that are of no concern here. So, let’s look into two extremely popular distance/similarity measures, the <strong>Euclidean distance</strong> and the <strong>cosine similarity</strong>.</p>
<h2 id="euclidean-distance">Euclidean distance</h2>
<p>Remember that in an Euclidean space we can understand a given pair of Cartesian coordinates as either coordinates, or alternatively, as vectors from the origin to the coordinates? For example, in a 2d-space where we have the coordinates <em>p</em> = (1, 5) and <em>q</em> = (4, 3) we could interpret the coordinates <em>p</em> and <em>q</em> also as vectors <strong>p</strong> = ((0,0), (1,5)) and <strong>q</strong> = ((0,0), (4,3)). Therefore, in Euclidean space, we can interpret the distance between p and q in two different ways - either as a distance between Cartesian coordinates <em>p</em> and <em>q</em> or as a distance between two vectors <strong>p</strong> and <strong>q</strong>. Calculating the distance between two coordinates in a 2d-space is something we learned in high-school using Pythagorean maths where <em>a</em><sup>2</sup> +<em>b</em><sup>2</sup> = <em>c</em><sup>2</sup> for a triangle with a right angle betwee <em>a</em> and <em>b</em>.</p>
<p><code>distance(p, q) = sqrt( (4-(-1))^2 + (3-5)^2 ) = sqrt(29)</code></p>
<p>The Euclidean distance is nothing but a generalization of this formula to <em>n</em> dimensions. Letl <strong>q</strong> - <strong>p</strong> be the so called <em>displacement vector</em> between <strong>p</strong> and <strong>q</strong>, that is a vector pointing from coordinate <em>p</em> to <em>q</em>. Thus:</p>
<p><strong>q</strong> - <strong>p</strong> = (q<sub>1</sub> - p<sub>1</sub>, q<sub>2</sub> - p<sub>2</sub>, …, q<sub>n</sub> - p<sub>n</sub>)</p>
<p>We have already seen how we can measure the length of a vector by it’s Euclidean or L2 norm. Thus, the length or magnitude of the vector between coordinate <em>p</em> and <em>q</em> is nothing but the L2 norm of the displacement vector <strong>q</strong> - <strong>p</strong>, that is | |<strong>q</strong> - <strong>p</strong> | |. Furthermore, as it is a length measure it does not really matter whether we calculate the distance between <em>p</em> and <em>q</em> or <em>q</em> and <em>p</em>, and therefore | |<strong>q</strong> - <strong>p</strong> | | = | |<strong>p</strong> - <strong>q</strong> | |.</p>
<p><img src="/public/img/2019-12-27-euclidean-distance-formula.png" alt="Euclidean Distance Formula" title="Euclidean Distance Formula" /></p>
<p>As you can see from the formula the Euclidean distance is the square root of the inner product of <strong>p</strong> - <strong>q</strong> (and also of <strong>q</strong> - <strong>p</strong>). Since we are using the dot product as the inner product, it turns out that the Euclidean distance is same as the L2 norm (Euclidean norm) | |<strong>p</strong> - <strong>q</strong> | |. The last two lines give a different style of writing again.</p>
<p>The Euclidean distance is always a non-negative number. If two vectors are identical, it is 0. The higher the Euclidean distance, the further two coordinates and thus their vectors are apart.</p>
<h2 id="cosine-similarity-and-cosine-distance">Cosine similarity and cosine distance</h2>
<p>A different distance way to measure similarity (and thus distance) between two vectors is to look solely at the angle θ occurring between two vectors. The definition of the cosine similarity of the angle can be derived very simply from the above mentioned definition of the dot product. The relation between cosine similarity and cosine distance is simply <code>cosine distance = 1 - cosine similarity</code>. Thus, the cosine distance is defined in a range [0, 2], whereas the cosine similarity is defined in a range [-1, 1].</p>
<p><img src="/public/img/2019-12-27-cosine-distance-formula.png" alt="Cosine Distance Formula" title="Cosine Distance Formula" /></p>
<table>
<thead>
<tr>
<th>Angle</th>
<th>Cosine Similarity</th>
<th>Cosine Distance</th>
<th>Geometric Interpretation</th>
</tr>
</thead>
<tbody>
<tr>
<td>0°</td>
<td>1</td>
<td>0</td>
<td>Overlapping vectors with same directionality</td>
</tr>
<tr>
<td>90°</td>
<td>0</td>
<td>1</td>
<td>Orthogonal vectors</td>
</tr>
<tr>
<td>180°</td>
<td>-1</td>
<td>2</td>
<td>Vectors pointing into opposite directions</td>
</tr>
<tr>
<td>270°</td>
<td>0</td>
<td>1</td>
<td>Orthogonal vectors again</td>
</tr>
<tr>
<td>360°</td>
<td>1</td>
<td>0</td>
<td>Overlapping vectors with same directionality</td>
</tr>
</tbody>
</table>
<h2 id="comparison-of-euclidean-and-cosine-distance">Comparison of Euclidean and cosine distance</h2>
<p>Whereas the Euclidean distance depends on the vectors’ magnitude and also the angle between the vectors, the cosine distance only depends on the angle but not on the magnitudes. This is an important difference between the two distance metrics.</p>
<p>However, when we unit normalize all our vectors then the Euclidean and the cosine distance fall together! Geometrically speaking unit normalized vectors have the same (L2) magnitude of 1. Thus, the distance between their coordinates directly correlates with the angle between the vectors, and therefore the Euclidean distance and the cosine distance are correlated with each other.</p>
<h1 id="further-resources">Further resources</h1>
<p><em>General introduction</em>:</p>
<ul>
<li><a href="http://wiki.fast.ai/index.php/Linear_Algebra_for_Deep_Learning">http://wiki.fast.ai/index.php/Linear_Algebra_for_Deep_Learning</a></li>
</ul>
<p><em>Dot and cross product</em>:</p>
<ul>
<li><a href="https://www.math.ucla.edu/~josephbreen/Understanding_the_Dot_Product_and_the_Cross_Product.pdf">Understanding the Dot Product and the Cross Product</a> by Joseph Breen</li>
<li><a href="https://betterexplained.com/articles/cross-product/">https://betterexplained.com/articles/cross-product/</a></li>
<li><a href="https://en.wikipedia.org/wiki/Cross_product">https://en.wikipedia.org/wiki/Cross_product</a></li>
<li><a href="https://www.mathsisfun.com/algebra/vectors-cross-product.html">https://www.mathsisfun.com/algebra/vectors.html</a></li>
<li><a href="https://www.mathsisfun.com/algebra/vectors-cross-product.html">https://www.mathsisfun.com/algebra/vectors-cross-product.html</a></li>
</ul>
<p><em>Difference between dot product and inner product</em>:</p>
<ul>
<li><a href="https://math.stackexchange.com/questions/476738/difference-between-dot-product-and-inner-product">https://math.stackexchange.com/questions/476738/difference-between-dot-product-and-inner-product</a></li>
</ul>
<p><em>Unit vector</em>:</p>
<ul>
<li><a href="https://math.stackexchange.com/questions/1717682/cross-product-equation-with-sine-i-dont-understand-the-unit-vector-n">Stackexchange discussion on unit vector in cross product</a></li>
<li><a href="https://math.stackexchange.com/questions/2169077/why-does-the-cross-product-have-n-in-its-formula-a-times-b-vert-a-vert-v">Another Stackexchange discussion on the role of the unit vector in cross product</a></li>
<li><a href="https://www.khanacademy.org/computing/computer-programming/programming-natural-simulations/programming-vectors/a/vector-magnitude-normalization">https://www.khanacademy.org/computing/computer-programming/programming-natural-simulations/programming-vectors/a/vector-magnitude-normalization</a></li>
</ul>
<p><em>Eudlicean versus cosine distance</em>:</p>
<ul>
<li><a href="https://cmry.github.io/notes/euclidean-v-cosine">https://cmry.github.io/notes/euclidean-v-cosine</a></li>
<li><a href="http://mlwiki.org/index.php/Cosine_Similarity">Euclidean and cosine distance for unit-normalized vectors</a></li>
</ul>
Introduction to Lucene 7 OpenNLP - Part 42018-10-07T00:00:00+02:00http://fabian-kostadinov.github.io/2018/10/07/introduction-to-lucene-opennlp-part4<p>I have written three blog posts about how to use Lucene 7 and OpenNLP to index part-of-speech tags and then use phrase queries to search on these tags. What I haven’t shown so far is what’s so cool about having such a capability.<span class="more"></span>
Imagine that you are building a search engine containing various articles. You are interested to know what nouns a particular indexed term is typically paired up with. For example, the noun “heart” could be paired up with another consecutive noun such as “heart failure”, “heart disease”, “heart”. Or it could be paired up with an adjective such as “broken heart”, “lonely heart” etc. That’s exactly what we can do now. We can mix up OpenNLP POS tags together with words in the same query. Let’s assume we are looking for triplets in this form:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>the expensive car
a slow car
a fancy automobile
a car park
the car accident
</code></pre></div></div>
<p>We can express these as four different types of queries mixing POS tags with the word car and automobile:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>DT JJ car
DT JJ automobile
DT car NN
DT automobile NN
</code></pre></div></div>
<p>This would result in the following query assuming the relevant text is indexed in a field called “body”:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>+(body:"DT JJ car" body:"DT JJ automobile" body:"DT car NN" body:"DT automobile NN")
</code></pre></div></div>
<p>We have also already seen how both a combination of phrase queries and boolean queries can be used to create such an expression, or as yet another option multi phrase queries.</p>
<p>As pointed out already earlier OpenNLP’s POS tags are following the <a href="https://www.clips.uantwerpen.be/pages/mbsp-tags">Penn Treebank II tag set standard</a>. Depending on the use case these tags might be too fine-grained. For example there are four different POS tags for nouns (NN, NNP, NNS, NNPS), several ones for verbs, for pronouns and so on. Although there exists a <a href="https://lucene.apache.org/core/7_4_0/core/index.html?overview-summary.html">WildcardQuery</a> class that allows us to build expressions such as NN*, wildcards can be computationally expensive. Wildcards are therefore not the best idea in this context.</p>
<p>A better option is to use our own simplified set of POS tags in the indexing step. For every POS tag output by OpenNLP a brief check is performed and the tag is replaced by a simpler form:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>NN -> NN
NNP -> NN
NNS -> NN
NNPS -> NN
VB -> VB
VBZ -> VB
...
</code></pre></div></div>
<p>Remember the class <a href="https://lucene.apache.org/core/7_4_0/core/index.html?overview-summary.html">TokenFilter</a>? We used one in our OpenNLPAnalyzer. For each input token [The][quick][brown][fox]… read from the token stream our token filter added the POS tag to the type field in the index: [DT][JJ][JJ][NN]… . We can simply wrap this filtered token stream again with yet another token filter. There are again different ways how to implement this, I’ll leave this as an exercise to the reader. A good start to learn more about how to write a token filter is both the implementation of class <a href="https://lucene.apache.org/core/7_4_0/analyzers-opennlp/org/apache/lucene/analysis/opennlp/OpenNLPPOSFilter.html">OpenNLPPosFilter</a> and <a href="http://nathanchen.github.io/14457412441370.html">this short tutorial</a>.</p>
<p>Maybe you are still wondering what exactly this all gives us. “It’s a nice use case”, you might be thinking, “but how is this relevant to anyone?” What it gives us is the ability to create taxonomies expressed as queries and then extract entities from documents. This is a pretty common use case for companies which have to process large numbers of text documents in an automated way. In fact, certain companies offering services in the space of natural language processing and AI are making money based on these or similar ideas.</p>
<h2 id="other-posts">Other posts</h2>
<h3 id="previous">Previous</h3>
<ul>
<li>Part 1: <a href="/2018/09/08/introduction-to-lucene-opennlp-part1/">Introduction to Lucene 7 OpenNLP - Part 1</a></li>
<li>Part 2: <a href="/2018/10/01/introduction-to-lucene-opennlp-part2/">Introduction to Lucene 7 OpenNLP - Part 2</a></li>
<li>Part 3: <a href="/2018/10/06/introduction-to-lucene-opennlp-part3/">Introduction to Lucene 7 OpenNLP - Part 3</a></li>
</ul>
Introduction to Lucene 7 OpenNLP - Part 32018-10-06T00:00:00+02:00http://fabian-kostadinov.github.io/2018/10/06/introduction-to-lucene-opennlp-part3<p>Now that we can do <a href="/2018/10/01/introduction-to-lucene-opennlp-part2/">searching on indexed part-of-speech tags</a> what’s still missing is a way to introduce an order of search terms. Remember: All POS tags in our query are simply <strong>OR</strong>ed together. So, how an we achieve this?<span class="more"></span>
Fortunately, this time the answer comes easily. We can use <a href="https://lucene.apache.org/core/7_4_0/core/org/apache/lucene/search/PhraseQuery.html">PhraseQuery</a>. According to the official docs a <em>PhraseQuery</em> is…</p>
<blockquote>
<p>A Query that matches documents containing a particular sequence of terms.</p>
</blockquote>
<p>Phrase queries are useful if we want to search for a sequence of POS tags such as “DT JJ NN” (matching texts like “the big tree”, “a remarkable man” etc.). To make things even better there is even a <a href="https://lucene.apache.org/core/7_4_0/core/org/apache/lucene/util/QueryBuilder.html">query builder</a> for this purpose:</p>
<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">org.apache.lucene.util.QueryBuilder</span><span class="o">;</span>
<span class="c1">// Some code</span>
<span class="nc">String</span> <span class="n">fieldName</span> <span class="o">=</span> <span class="s">"body"</span><span class="o">;</span> <span class="c1">// or any other fieldName matching the indexed data</span>
<span class="nc">Analyzer</span> <span class="n">analyzer</span> <span class="o">=</span> <span class="o">...;</span> <span class="c1">// For example: new OpenNLPAnalyzer();</span>
<span class="nc">QueryBuilder</span> <span class="n">builder</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">QueryBuilder</span><span class="o">(</span><span class="n">analyzer</span><span class="o">);</span>
<span class="nc">Query</span> <span class="n">query</span> <span class="o">=</span> <span class="n">builder</span><span class="o">.</span><span class="na">createPhraseQuery</span><span class="o">(</span><span class="n">fieldName</span><span class="o">,</span> <span class="s">"DT JJ NN"</span><span class="o">);</span>
</code></pre></div></div>
<p>Executing the query will no longer yield documents with determiners, adjectives and nouns all mixed together, but instead yield only those indexed docs where this sequence of POS tags occurrs in the exact given order.</p>
<p>Cool stuff. What could we want more? Well, let’s assume that we want to extend our search. Rather than searching only for the string “DT JJ NN” we also want to find strings “DT NN NN” (such as “a child genius”). <br />
There are different ways to achieve this. One way would be to build two phrase queries, then build a boolean query and use an OR operator. Expressed in proper Lucene syntax, this would equate to:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>+(body:"DT JJ NN" body:"DT NN NN")
</code></pre></div></div>
<p><em>body</em> refers to the field to search on. The two phrase queries in the bracket are <strong>OR</strong>ed together, wo either of them (or both) can match. In order to prevent that none of them matches yet (poorly matching) results are returend, we use a plus operator in front of the whole expression. This indicates that the final query MUST produce a match in its totality.</p>
<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nc">QueryBuilder</span> <span class="n">builder1</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">QueryBuilder</span><span class="o">(</span><span class="n">analyzer</span><span class="o">);</span>
<span class="nc">Query</span> <span class="n">phraseQuery1</span> <span class="o">=</span> <span class="n">builder1</span><span class="o">.</span><span class="na">createPhraseQuery</span><span class="o">(</span><span class="n">fieldName</span><span class="o">,</span> <span class="s">"DT JJ NN"</span><span class="o">);</span>
<span class="nc">Query</span> <span class="n">phraseQuery2</span> <span class="o">=</span> <span class="n">builder1</span><span class="o">.</span><span class="na">createPhraseQuery</span><span class="o">(</span><span class="n">fieldName</span><span class="o">,</span> <span class="s">"DT NN NN"</span><span class="o">);</span>
<span class="nc">BooleanQuery</span> <span class="n">tmp</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">BooleanQuery</span><span class="o">.</span><span class="na">Builder</span><span class="o">()</span>
<span class="o">.</span><span class="na">add</span><span class="o">(</span><span class="n">phraseQuery1</span><span class="o">,</span> <span class="nc">BooleanClause</span><span class="o">.</span><span class="na">Occur</span><span class="o">.</span><span class="na">SHOULD</span><span class="o">)</span>
<span class="o">.</span><span class="na">add</span><span class="o">(</span><span class="n">phraseQuery2</span><span class="o">,</span> <span class="nc">BooleanClause</span><span class="o">.</span><span class="na">Occur</span><span class="o">.</span><span class="na">SHOULD</span><span class="o">)</span>
<span class="o">.</span><span class="na">build</span><span class="o">();</span>
<span class="c1">// At this time both phrase queries are only ORed together. They SHOULD match,</span>
<span class="c1">// but this does not mean any of them MUST match. By using another combining</span>
<span class="c1">// boolean query with a MUST clause we guarantee that at least one of them</span>
<span class="c1">// matches.</span>
<span class="nc">BooleanQuery</span> <span class="n">finalQuery</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">BooleanQuery</span><span class="o">.</span><span class="na">Builder</span><span class="o">().</span><span class="na">add</span><span class="o">(</span><span class="n">tmp</span><span class="o">,</span> <span class="nc">BooleanClause</span><span class="o">.</span><span class="na">Occur</span><span class="o">.</span><span class="na">MUST</span><span class="o">).</span><span class="na">build</span><span class="o">();</span>
</code></pre></div></div>
<p>Yet another, perhaps more concise way of achieving the same result would be to use a <a href="https://lucene.apache.org/core/7_4_0/core/org/apache/lucene/search/MultiPhraseQuery.html">MultiPhraseQuery</a>. A <em>MultiPhraseQuery</em> is…</p>
<blockquote>
<p>A generalized version of PhraseQuery, with the possibility of adding more than one term at the same position that are treated as a disjunction (OR).</p>
</blockquote>
<p>Note that when using phrase queries and boolean queries we can rely on Lucene’s <a href="http://lucene.apache.org/core/7_4_0/queryparser/index.html?overview-summary.html">QueryParser</a> class. To my knowledge, no query parser exists for multi phrase queries in Lucene 7.4. We’ll have to create our query object by instantiating query terms ourselves.</p>
<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nc">MultiPhraseQuery</span> <span class="n">query</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">MultiPhraseQuery</span><span class="o">.</span><span class="na">Builder</span><span class="o">()</span>
<span class="o">.</span><span class="na">add</span><span class="o">(</span><span class="k">new</span> <span class="nc">Term</span><span class="o">(</span><span class="n">fieldName</span><span class="o">,</span> <span class="s">"DT"</span><span class="o">))</span>
<span class="o">.</span><span class="na">add</span><span class="o">(</span><span class="k">new</span> <span class="nc">Term</span><span class="o">[]</span> <span class="o">{</span><span class="k">new</span> <span class="nc">Term</span><span class="o">(</span><span class="n">fieldName</span><span class="o">,</span> <span class="s">"JJ"</span><span class="o">),</span> <span class="k">new</span> <span class="nc">Term</span><span class="o">(</span><span class="n">fieldName</span><span class="o">,</span> <span class="s">"NN"</span><span class="o">)})</span>
<span class="o">.</span><span class="na">add</span><span class="o">(</span><span class="k">new</span> <span class="nc">Term</span><span class="o">(</span><span class="n">fieldName</span><span class="o">,</span> <span class="s">"NN"</span><span class="o">))</span>
<span class="o">.</span><span class="na">build</span><span class="o">();</span>
</code></pre></div></div>
<p>Printing this multi phrase query to the console we see that it actually exactly produces the same query string as we had constructed above using boolean queries:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>+(body:"DT JJ NN" body:"DT NN NN")
</code></pre></div></div>
<p>This brought me to yet another idea. Wouldn’t it be possible to use regular expressions in order to build even more complicated query expressions? Something like this: “DT ((JJ)|(NN))+ NN”. After a little research I indeed found out that in Lucene 7 there is a class <a href="https://lucene.apache.org/core/7_4_0/core/org/apache/lucene/search/RegexpQuery.html">RegexpQuery</a> that supports building queries following the syntax described in class <a href="https://lucene.apache.org/core/7_4_0/core/org/apache/lucene/util/automaton/RegExp.html">RegExp</a>. Great, I thought, and implemented this:</p>
<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nc">String</span> <span class="n">searchRegex</span> <span class="o">=</span> <span class="s">"DT JJ NN"</span><span class="o">;</span>
<span class="nc">Query</span> <span class="n">query</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">RegexpQuery</span><span class="o">(</span><span class="k">new</span> <span class="nc">Term</span><span class="o">(</span><span class="n">fieldName</span><span class="o">,</span> <span class="n">searchRegex</span><span class="o">));</span>
</code></pre></div></div>
<p>And the result? Nothing. This query actually returns 0 search results. At first, I was of course puzzled. It took me some time to realize that RegexpQuery is a subclass of the more general <a href="https://lucene.apache.org/core/7_4_0/core/org/apache/lucene/search/AutomatonQuery.html">AutomatonQuery</a> class. And AutomatonQuery is…</p>
<blockquote>
<p>A Query that will match terms against a finite-state machine.</p>
</blockquote>
<p>You have to read this sentence very carefully. See my mistake? AutomatonQuery, and therefore also RegexpQuery, operates on individual indexed <em>terms</em> - not on <em>sequences</em> of terms like phrase queries do. Automaton queries and regular expression queries in Lucene can match a text like “DT JJ NN” only if they have been indexed not as separate terms but as a single term. WhitespaceAnalyzer, StandardAnalyzer and our own custom OpenNLPAnalyzer class all tokenize input texts relying on whitespaces (plus some more logic). They all split an input sequence of “the lazy dogs” into three tokens [the][lazy][dogs] based on which our OpenNLPAnalyzer adds the three POS tag tokens [DT][JJ][NN] again as three separate POS tags. In order to make use of RegexpQuery we’d however need [DT JJ NN] as a single token. Now, this can certainly be done during indexing time, but it would represent a new strategy of how to index terms. Whether or not a change of our indexing strategy is worth it of course depends on our use case. Furthermore, in general one should be careful when using regular expressions in queries. Regular expressions can be dangerous and certain types of regexes can actually bring our server down. From the <a href="https://www.elastic.co/guide/en/elasticsearch/reference/master/query-dsl-regexp-query.html">official Elasticsearch reference guide</a>:</p>
<blockquote>
<p>Regular expressions are dangerous because it’s easy to accidentally create an innocuous looking one that requires an exponential number of internal determinized automaton states (and corresponding RAM and CPU) for Lucene to execute.</p>
</blockquote>
<h2 id="other-articles">Other articles</h2>
<h3 id="previous">Previous</h3>
<ul>
<li>Part 1: <a href="/2018/09/08/introduction-to-lucene-opennlp-part1/">Introduction to Lucene 7 OpenNLP - Part 1</a></li>
<li>Part 2: <a href="/2018/10/01/introduction-to-lucene-opennlp-part2/">Introduction to Lucene 7 OpenNLP - Part 2</a></li>
</ul>
<h3 id="next">Next</h3>
<ul>
<li>Part 4: <a href="/2018/10/07/introduction-to-lucene-opennlp-part4/">Introduction to Lucene 7 OpenNLP - Part 4</a></li>
</ul>
<h3 id="further-material">Further material</h3>
<ul>
<li><a href="https://opensourceconnections.com/blog/2013/02/21/lucene-4-finite-state-automaton-in-10-minutes-intro-tutorial/">Somewhat outdated article on automata in Lucene 4</a></li>
<li><a href="http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html">Another outdated article on multi phrase and other more advanced queries</a></li>
</ul>
Introduction to Lucene 7 OpenNLP - Part 22018-10-01T00:00:00+02:00http://fabian-kostadinov.github.io/2018/10/01/introduction-to-lucene-opennlp-part2<p>In my <a href="/2018/09/08/introduction-to-lucene-opennlp-part1/">previous post</a> I promised I’d describe how to perform searches on indexed part-of-speech data with Lucene 7 and OpenNLP. Let’s have a look. (Thanks to <a href="https://stackoverflow.com/users/10277631/koji">Koji</a> on this one!)<span class="more"></span>
We have already seen how to create an index and then add some data.</p>
<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nc">Directory</span> <span class="n">index</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">RAMDirectory</span><span class="o">();</span>
<span class="nc">OpenNLPAnalyzer</span> <span class="n">analyzer</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">OpenNLPAnalyzer</span><span class="o">();</span>
<span class="nc">IndexWriterConfig</span> <span class="n">indexWriterConfig</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">IndexWriterConfig</span><span class="o">(</span><span class="n">analyzer</span><span class="o">);</span>
<span class="nc">IndexWriter</span> <span class="n">writer</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">IndexWriter</span><span class="o">(</span><span class="n">index</span><span class="o">,</span> <span class="n">indexWriterConfig</span><span class="o">);</span>
<span class="kd">final</span> <span class="nc">String</span> <span class="n">fieldName</span> <span class="o">=</span> <span class="s">"body"</span><span class="o">;</span>
<span class="nc">Document</span> <span class="n">document1</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">Document</span><span class="o">();</span>
<span class="c1">// POS tags: [DT][JJ][JJ][NN][VBD][IN][DT][JJ][NNS][.]</span>
<span class="n">document</span><span class="o">.</span><span class="na">add</span><span class="o">(</span><span class="k">new</span> <span class="nc">TextField</span><span class="o">(</span><span class="n">fieldName</span><span class="o">,</span> <span class="s">"The quick brown fox jumped over the lazy dogs."</span><span class="o">,</span> <span class="nc">Field</span><span class="o">.</span><span class="na">Store</span><span class="o">.</span><span class="na">YES</span><span class="o">));</span>
<span class="n">writer</span><span class="o">.</span><span class="na">addDocument</span><span class="o">(</span><span class="n">document1</span><span class="o">);</span>
<span class="nc">Document</span> <span class="n">document2</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">Document</span><span class="o">();</span>
<span class="c1">// POS tags: [VB][PRP][TO][PRP][,][UH][.]</span>
<span class="n">document2</span><span class="o">.</span><span class="na">add</span><span class="o">(</span><span class="k">new</span> <span class="nc">TextField</span><span class="o">(</span><span class="n">fieldName</span><span class="o">,</span> <span class="s">"Give it to me, baby!"</span><span class="o">,</span> <span class="nc">Field</span><span class="o">.</span><span class="na">Store</span><span class="o">.</span><span class="na">YES</span><span class="o">));</span>
<span class="n">writer</span><span class="o">.</span><span class="na">addDocument</span><span class="o">(</span><span class="n">document2</span><span class="o">);</span>
<span class="nc">Document</span> <span class="n">document3</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">Document</span><span class="o">();</span>
<span class="c1">// POS tags: [NNP][NNP][VBZ][DT][JJ][NNS][.]</span>
<span class="c1">// Note that the token [Mr.] - including the dot - results in [NNP].</span>
<span class="n">document3</span><span class="o">.</span><span class="na">add</span><span class="o">(</span><span class="k">new</span> <span class="nc">TextField</span><span class="o">(</span><span class="n">fieldName</span><span class="o">,</span> <span class="s">"Mr. Robot is a great TV-series."</span><span class="o">,</span> <span class="nc">Field</span><span class="o">.</span><span class="na">Store</span><span class="o">.</span><span class="na">YES</span><span class="o">));</span>
<span class="n">writer</span><span class="o">.</span><span class="na">addDocument</span><span class="o">(</span><span class="n">document3</span><span class="o">);</span>
<span class="nc">Document</span> <span class="n">document4</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">Document</span><span class="o">();</span>
<span class="c1">// POS tags: [VB][PRP][TO][PRP][,][NNP][.]</span>
<span class="n">document4</span><span class="o">.</span><span class="na">add</span><span class="o">(</span><span class="k">new</span> <span class="nc">TextField</span><span class="o">(</span><span class="n">fieldName</span><span class="o">,</span> <span class="s">"Give them to us, Dalton!"</span><span class="o">,</span> <span class="nc">Field</span><span class="o">.</span><span class="na">Store</span><span class="o">.</span><span class="na">YES</span><span class="o">));</span>
<span class="n">writer</span><span class="o">.</span><span class="na">addDocument</span><span class="o">(</span><span class="n">document4</span><span class="o">);</span>
<span class="n">writer</span><span class="o">.</span><span class="na">close</span><span class="o">();</span>
</code></pre></div></div>
<p>The exact field name used - here it is <em>body</em> - does not really matter too much, you can use any name you want. But you must make sure that during searching further below you again use the same field name! Furthermore, it is generally recommended to use the same analyzer as you’ve used to index your data with. In our case this is OpenNLPAnalyzer. This is however not a strict must. In a real search application there are reasons why you may want to create different analyzers for indexing data and for parsing queries.</p>
<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Searching for documents containing both (at least one) adjective or (at least one) noun</span>
<span class="kd">final</span> <span class="nc">String</span> <span class="n">searchPhrase</span> <span class="o">=</span> <span class="s">"JJ NN"</span><span class="o">;</span>
<span class="c1">// The maximum number of top n returned results</span>
<span class="kd">final</span> <span class="kt">int</span> <span class="n">topN</span> <span class="o">=</span> <span class="mi">10</span><span class="o">;</span>
<span class="nc">DirectoryReader</span> <span class="n">reader</span> <span class="o">=</span> <span class="nc">DirectoryReader</span><span class="o">.</span><span class="na">open</span><span class="o">(</span><span class="n">index</span><span class="o">);</span>
<span class="nc">IndexSearcher</span> <span class="n">searcher</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">IndexSearcher</span><span class="o">(</span><span class="n">reader</span><span class="o">);</span>
<span class="c1">// fieldName was specified above to be string "body"</span>
<span class="nc">QueryParser</span> <span class="n">parser</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">QueryParser</span><span class="o">(</span><span class="n">fieldName</span><span class="o">,</span> <span class="k">new</span> <span class="nc">OpenNLPAnalyzer</span><span class="o">());</span>
<span class="nc">Query</span> <span class="n">query</span> <span class="o">=</span> <span class="n">parser</span><span class="o">.</span><span class="na">parse</span><span class="o">(</span><span class="n">searchPhrase</span><span class="o">);</span>
<span class="nc">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="n">query</span><span class="o">);</span>
<span class="nc">TopDocs</span> <span class="n">topDocs</span> <span class="o">=</span> <span class="n">searcher</span><span class="o">.</span><span class="na">search</span><span class="o">(</span><span class="n">query</span><span class="o">,</span> <span class="n">topN</span><span class="o">);</span>
<span class="nc">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">printf</span><span class="o">(</span><span class="s">"%s => %d hits\n"</span><span class="o">,</span> <span class="n">searchPhrase</span><span class="o">,</span> <span class="n">topDocs</span><span class="o">.</span><span class="na">totalHits</span><span class="o">);</span>
<span class="k">for</span><span class="o">(</span><span class="nc">ScoreDoc</span> <span class="nl">scoreDoc:</span> <span class="n">topDocs</span><span class="o">.</span><span class="na">scoreDocs</span><span class="o">){</span>
<span class="nc">Document</span> <span class="n">doc</span> <span class="o">=</span> <span class="n">searcher</span><span class="o">.</span><span class="na">doc</span><span class="o">(</span><span class="n">scoreDoc</span><span class="o">.</span><span class="na">doc</span><span class="o">);</span>
<span class="nc">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">printf</span><span class="o">(</span><span class="s">"\t%s\n"</span><span class="o">,</span> <span class="n">doc</span><span class="o">.</span><span class="na">get</span><span class="o">(</span><span class="n">field</span><span class="o">));</span>
<span class="o">}</span>
</code></pre></div></div>
<p>Here is the output:</p>
<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">body:</span><span class="no">JJ</span> <span class="nl">body:</span><span class="no">NN</span>
<span class="no">JJ</span> <span class="no">NN</span> <span class="o">=></span> <span class="mi">0</span> <span class="n">hits</span>
</code></pre></div></div>
<p>Our search yielded no results! How is this possible? Well. This is where things unfortunately get a bit complicated. There’s something I haven’t told you in the <a href="/2018/09/08/introduction-to-lucene-opennlp-part1/">previous post</a>. <a href="https://opensourceconnections.com/blog/2018/08/06/intro_solr_nlp_integrations/">Elizabeth Haubert mentions it in her blog post</a>:</p>
<blockquote>
<p>Since Lucene does not yet index token types, in order to make that information available to queries, it is necessary to push the type either to a payload or as a synonym token using either TypeAsPayloadFilterFactory or TypeAsSynonymFilterFactory.</p>
</blockquote>
<p>If you are like me you didn’t understand this piece of information while reading it the first time and simply took a whole-hearted decision to ignore it. What does this mean? Well, it means exactly what it states: For any reason known only to the developers of the Lucene library the <em>type</em> token is not indexed by default. And of course this is exactly what OpenNLP uses behind the scenes. In class <em>OpenNLPPOSFilter</em> you will find these two lines:</p>
<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">private</span> <span class="kd">final</span> <span class="nc">TypeAttribute</span> <span class="n">typeAtt</span> <span class="o">=</span> <span class="o">(</span><span class="nc">TypeAttribute</span><span class="o">)</span><span class="k">this</span><span class="o">.</span><span class="na">addAttribute</span><span class="o">(</span><span class="nc">TypeAttribute</span><span class="o">.</span><span class="na">class</span><span class="o">);</span>
<span class="c1">// And further below while parsing tokens:</span>
<span class="k">this</span><span class="o">.</span><span class="na">typeAtt</span><span class="o">.</span><span class="na">setType</span><span class="o">(</span><span class="k">this</span><span class="o">.</span><span class="na">tags</span><span class="o">[</span><span class="k">this</span><span class="o">.</span><span class="na">tokenNum</span><span class="o">++]);</span>
</code></pre></div></div>
<p><em>OpenNLPPOSFilter</em> that we were using in our <em>OpenNLPAnalyzer</em> class uses internally a <em>TypeAttribute</em>. And, as we’ve just learned, type attributes are not indexed by Lucene. So what to do? Well, the answer is given above. Use either <em>TypeAsPayloadFilterFactory</em> or <em>TypeAsSynonymFilterFactory</em> instead. Now a word of caution: Trying to use the former will bring you in hell’s kitchen. No documentation paired up with outdated tutorials due to fundamental design changes of payloads between major Lucene versions will make your hair turn white and your teeth fall out. So, use the latter. You have been warned.</p>
<p>Ok, so how can we do it? Only a small change in our <em>OpenNLPAnalyzer</em> class is needed. We simply need to wrap our <em>OpenNLPPOSFilter</em> into a <em>TypeAsSynonymFilter</em> a done in the return statement. The class can be found in lucene-analyzers-common library.</p>
<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">package</span> <span class="nn">com.example</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">opennlp.tools.postag.POSModel</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">opennlp.tools.sentdetect.SentenceModel</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">opennlp.tools.tokenize.TokenizerModel</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.lucene.analysis.Analyzer</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.lucene.analysis.TokenFilter</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.lucene.analysis.Tokenizer</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.lucene.analysis.miscellaneous.TypeAsSynonymFilter</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.lucene.analysis.opennlp.OpenNLPPOSFilter</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.lucene.analysis.opennlp.OpenNLPTokenizer</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.lucene.analysis.opennlp.tools.NLPPOSTaggerOp</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.lucene.analysis.opennlp.tools.NLPSentenceDetectorOp</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.lucene.analysis.opennlp.tools.NLPTokenizerOp</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.lucene.analysis.opennlp.tools.OpenNLPOpsFactory</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.lucene.analysis.util.ClasspathResourceLoader</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.lucene.analysis.util.ResourceLoader</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.lucene.util.AttributeFactory</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">java.io.IOException</span><span class="o">;</span>
<span class="kd">public</span> <span class="kd">class</span> <span class="nc">OpenNLPAnalyzer</span> <span class="kd">extends</span> <span class="nc">Analyzer</span> <span class="o">{</span>
<span class="kd">protected</span> <span class="nc">TokenStreamComponents</span> <span class="nf">createComponents</span><span class="o">(</span><span class="nc">String</span> <span class="n">fieldName</span><span class="o">)</span> <span class="o">{</span>
<span class="k">try</span> <span class="o">{</span>
<span class="nc">ResourceLoader</span> <span class="n">resourceLoader</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">ClasspathResourceLoader</span><span class="o">(</span><span class="nc">ClassLoader</span><span class="o">.</span><span class="na">getSystemClassLoader</span><span class="o">());</span>
<span class="nc">TokenizerModel</span> <span class="n">tokenizerModel</span> <span class="o">=</span> <span class="nc">OpenNLPOpsFactory</span><span class="o">.</span><span class="na">getTokenizerModel</span><span class="o">(</span><span class="s">"en-token.bin"</span><span class="o">,</span> <span class="n">resourceLoader</span><span class="o">);</span>
<span class="nc">NLPTokenizerOp</span> <span class="n">tokenizerOp</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">NLPTokenizerOp</span><span class="o">(</span><span class="n">tokenizerModel</span><span class="o">);</span>
<span class="nc">SentenceModel</span> <span class="n">sentenceModel</span> <span class="o">=</span> <span class="nc">OpenNLPOpsFactory</span><span class="o">.</span><span class="na">getSentenceModel</span><span class="o">(</span><span class="s">"en-sent.bin"</span><span class="o">,</span> <span class="n">resourceLoader</span><span class="o">);</span>
<span class="nc">NLPSentenceDetectorOp</span> <span class="n">sentenceDetectorOp</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">NLPSentenceDetectorOp</span><span class="o">(</span><span class="n">sentenceModel</span><span class="o">);</span>
<span class="nc">Tokenizer</span> <span class="n">source</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">OpenNLPTokenizer</span><span class="o">(</span>
<span class="nc">AttributeFactory</span><span class="o">.</span><span class="na">DEFAULT_ATTRIBUTE_FACTORY</span><span class="o">,</span> <span class="n">sentenceDetectorOp</span><span class="o">,</span> <span class="n">tokenizerOp</span><span class="o">);</span>
<span class="nc">POSModel</span> <span class="n">posModel</span> <span class="o">=</span> <span class="nc">OpenNLPOpsFactory</span><span class="o">.</span><span class="na">getPOSTaggerModel</span><span class="o">(</span><span class="s">"en-pos-maxent.bin"</span><span class="o">,</span> <span class="n">resourceLoader</span><span class="o">);</span>
<span class="nc">NLPPOSTaggerOp</span> <span class="n">posTaggerOp</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">NLPPOSTaggerOp</span><span class="o">(</span><span class="n">posModel</span><span class="o">);</span>
<span class="nc">TokenFilter</span> <span class="n">filter</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">OpenNLPPOSFilter</span><span class="o">(</span><span class="n">source</span><span class="o">,</span> <span class="n">posTaggerOp</span><span class="o">);</span>
<span class="k">return</span> <span class="k">new</span> <span class="nf">TokenStreamComponents</span><span class="o">(</span><span class="n">source</span><span class="o">,</span> <span class="k">new</span> <span class="nc">TypeAsSynonymFilter</span><span class="o">(</span><span class="n">filter</span><span class="o">));</span>
<span class="o">}</span>
<span class="k">catch</span> <span class="o">(</span><span class="nc">IOException</span> <span class="n">e</span><span class="o">)</span> <span class="o">{</span>
<span class="c1">// Do something...</span>
<span class="o">}</span>
<span class="o">}</span>
<span class="o">}</span>
</code></pre></div></div>
<p><em>TypeAsSynonymFilter</em> performs a little trick behind the scenes. It adds the type information as a synonym to the term attribute. This means that going forward an adjective like [quick] and the POS type token [JJ] will be treated as synonyms.</p>
<p>Let’s execute the same search as before. This time the following results are returned:</p>
<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">body:</span><span class="no">JJ</span> <span class="nl">body:</span><span class="no">NN</span>
<span class="no">JJ</span> <span class="no">NN</span> <span class="o">=></span> <span class="mi">2</span> <span class="n">hits</span>
<span class="nc">The</span> <span class="n">quick</span> <span class="n">brown</span> <span class="n">fox</span> <span class="n">jumped</span> <span class="n">over</span> <span class="n">the</span> <span class="n">lazy</span> <span class="n">baby</span><span class="o">.</span>
<span class="nc">Mr</span><span class="o">.</span> <span class="nc">Robot</span> <span class="n">is</span> <span class="n">a</span> <span class="n">great</span> <span class="no">TV</span><span class="o">-</span><span class="n">series</span><span class="o">.</span>
</code></pre></div></div>
<p>Congratulations, we found two indexed documents containing both an adjective and a noun. Since search terms are <strong>OR</strong>ed together the order of our POS tags in the query does not matter. “JJ NN” will yield the same results as “NN JJ”. Furthermore, since words and their POS tags are treated as synonyms we can also mix them in our query. Note however that the adjective “brown” might be treated synonymously with the POS tag “JJ”, but the opposite is not true!</p>
<p>Maybe if I find the time I’ll write another piece how introduce order on our query terms. Searching for “JJ JJ NN” should not simply OR them together, but should preserve the sequence of query terms.</p>
<hr />
<h2 id="other-articles">Other articles</h2>
<h3 id="previous">Previous</h3>
<ul>
<li>Part 1: <a href="/2018/09/08/introduction-to-lucene-opennlp-part1/">Introduction to Lucene 7 OpenNLP - Part 1</a></li>
</ul>
<h3 id="next">Next</h3>
<ul>
<li>Part 3: <a href="/2018/10/06/introduction-to-lucene-opennlp-part3/">Introduction to Lucene 7 OpenNLP - Part 3</a></li>
<li>Part 4: <a href="/2018/10/07/introduction-to-lucene-opennlp-part4/">Introduction to Lucene 7 OpenNLP - Part 4</a></li>
</ul>
Introduction to Lucene 7 OpenNLP - Part 12018-09-08T00:00:00+02:00http://fabian-kostadinov.github.io/2018/09/08/introduction-to-lucene-opennlp-part1<p>I’ve carried the idea to use OpenNLP to do part-of-speech tagging and index the POS tags with Lucene around with me for quite some time. Turns out Lucene 7 comes shipped with support for OpenNLP. Of course I had to try it out.<span class="more"></span>
Before starting, I highly recommend carefully reading through this official documentation of the Lucene analysis package: <a href="">https://lucene.apache.org/core/7_4_0/core/org/apache/lucene/analysis/package-summary.html</a>.</p>
<p>First, I created a new Java project in my IDE and added these two dependencies to my Maven’s pom.xml:</p>
<div class="language-xml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt"><dependency></span>
<span class="nt"><groupId></span>org.apache.lucene<span class="nt"></groupId></span>
<span class="nt"><artifactId></span>lucene-core<span class="nt"></artifactId></span>
<span class="nt"><version></span>7.4.0<span class="nt"></version></span>
<span class="nt"></dependency></span>
<span class="nt"><dependency></span>
<span class="nt"><groupId></span>org.apache.lucene<span class="nt"></groupId></span>
<span class="nt"><artifactId></span>lucene-analyzers-opennlp<span class="nt"></artifactId></span>
<span class="nt"><version></span>7.4.0<span class="nt"></version></span>
<span class="nt"></dependency></span>
</code></pre></div></div>
<p>Lucene uses a separate library just for the OpenNLP integration called <em>lucene-analyzers-opennlp</em>.</p>
<p>Let’s briefly review the default way how to index a document using a StandardAnalyzer. According to the official documentation:</p>
<blockquote>
<p>An Analyzer is responsible for supplying a TokenStream which can be consumed by the indexing and searching processes.</p>
</blockquote>
<p>And further below:</p>
<blockquote>
<p>The Analyzer is a factory for analysis chains. Analyzers don’t process text, Analyzers construct CharFilters, Tokenizers, and/or TokenFilters that process text.</p>
</blockquote>
<p>One point to remember is that if a document contains several fields that should be added to the index, by default an analyzer is applied to all indexed fields. <br />
So, here’s the standard way of indexing documents in Lucene.</p>
<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Step 0: This is the text to be analyzed, i.e. indexed.</span>
<span class="kd">final</span> <span class="nc">String</span> <span class="n">text</span> <span class="o">=</span> <span class="s">"The quick brown fox jumped over the lazy dogs."</span><span class="o">;</span>
<span class="c1">// Step 1: Create a directory, we'll use an in-memory directory for this purpose.</span>
<span class="nc">Directory</span> <span class="n">memoryIndex</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">RAMDirectory</span><span class="o">();</span>
<span class="c1">// Step 2: Create a new analyzer. StandardAnalyzer is very powerful and is sufficient for most cases.</span>
<span class="c1">// However, with OpenNLP we will implement our own analyzer further below.</span>
<span class="nc">StandardAnalyzer</span> <span class="n">analyzer</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">StandardAnalyzer</span><span class="o">();</span>
<span class="c1">// Step 3: Create an index writer config object, it takes an analyzer as its parameter.</span>
<span class="nc">IndexWriterConfig</span> <span class="n">indexWriterConfig</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">IndexWriterConfig</span><span class="o">(</span><span class="n">analyzer</span><span class="o">);</span>
<span class="c1">// Step 4: Create the actual index writer giving it the directory object and the index writer config object.</span>
<span class="nc">IndexWriter</span> <span class="n">writer</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">IndexWriter</span><span class="o">(</span><span class="n">memoryIndex</span><span class="o">,</span> <span class="n">indexWriterConfig</span><span class="o">);</span>
<span class="c1">// Step 5: We can now create new documents, and add some text fields to be indexed.</span>
<span class="nc">Document</span> <span class="n">document</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">Document</span><span class="o">();</span>
<span class="n">document</span><span class="o">.</span><span class="na">add</span><span class="o">(</span><span class="k">new</span> <span class="nc">TextField</span><span class="o">(</span><span class="s">"text"</span><span class="o">,</span> <span class="n">body</span><span class="o">,</span> <span class="nc">Field</span><span class="o">.</span><span class="na">Store</span><span class="o">.</span><span class="na">YES</span><span class="o">));</span>
<span class="c1">// Step 6: This line will effectively invoke the indexing process and send the new document to the writer.</span>
<span class="n">writer</span><span class="o">.</span><span class="na">addDocument</span><span class="o">(</span><span class="n">document</span><span class="o">);</span>
<span class="c1">// Step 7: The document was indexed, and we can now close the writer.</span>
<span class="n">writer</span><span class="o">.</span><span class="na">close</span><span class="o">();</span>
</code></pre></div></div>
<p>After the indexing phase is over, we could now trigger queries on the indexed documents. A QueryParser also needs to apply the same type of analyzer to the query, otherwise it would not be possible to match the parsed query to the indexed documents. More details on querying will be covered in a later article.</p>
<p>The analysis process converts text into indexable and thus searchable tokens. This works in several phases or layers.</p>
<ol>
<li>First, a reader object reads a stream of individual characters. At this level, a Lucene <em>CharFilter</em> could be used. A good example would be to drop all non-printable characters occurring in an input stream.</li>
<li>Next, a <em>Tokenizer</em> assembles characters until it recognizes a token. Hence it returns a token stream.</li>
<li>Finally, a <em>TokenFilter</em> can be applied to discard or modify tokens. Imagine that for anonymization you needed to replace all person names in a strem of tokens with a unique random string, this would probably be a good application for a token filter. Note that behind the scenes <em>TokenFilter</em> actually extends <em>Tokenizer</em>.</li>
</ol>
<p>Whereas using char filters and token filters is optional, using a tokenizer is not.</p>
<p>Next concept that we need to understand are attributes. The best explanation what Lucene’s attributes are I found in <a href="https://citrine.io/2015/02/15/building-a-custom-analyzer-in-lucene/">this tutorial</a>.</p>
<blockquote>
<p>Lucene uses attributes to store information about a single token. For this tokenizer, the only attribute that we are going to use is the CharTermAttribute, which can store the text for the token that is generated. Other types of attributes exist (see interfaces and classes derived from org.apache.lucene.util.Attribute); we will use some of these other attributes when we build our custom token filter. It is important that you register attributes, whatever their type, using the addAttribute() function.</p>
</blockquote>
<p>What sort of token information could one be possibly interested in to store? Well, there’s plenty.</p>
<ul>
<li>The start and end character position or the token position of the token in the sentence or text. Welcome <em>OffsetAttribute</em> and <em>PositionIncrementAttribute</em>.</li>
<li>The token string (i.e. the word) itself. Use <em>CharTermAttribute</em>.</li>
<li>Whether the token is a word, or perhaps only a punctuation. <em>TypeAttribute</em> is your choice.</li>
<li>Timestamp when it was processed or indexed. No library class available, build your own!</li>
</ul>
<p>We wanted to do part-of-speech tagging, remember? So, yet another attribute we might be interested in is to store the POS information on the particular token. We have two options: Build our own POS attribute or, my preferred choice for the rest of the tutorial, reuse the existing <em>TypeAttribute</em>.</p>
<p>In order to do part-of-speech tagging we will use a token filter that for each token read from the token stream adds an additional type to the indexed data containing the part-of-speech tag for that token or word. For example, take this input sentence: <em>The quick brown fox jumped over the lazy dogs.</em> A tokenizer splits the sentence into tokens:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[The][quick][brown][fox][jumped][over][the][lazy][dogs][.]
</code></pre></div></div>
<p>(To keep the example simple, I will omit lower-casing, stopword-removal, stemming/lemmatization and other typical text preprocessing steps here.) In a second step, for each token consumed a POS token filter adds a part-of-speech tag as type attribute to the indexed data:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[DT][JJ][JJ][NN][VBD][IN][DT][JJ][NNS][.]
</code></pre></div></div>
<p>OpenNLP uses POS labels from the <a href="https://www.clips.uantwerpen.be/pages/mbsp-tags">Penn Treebank II tag set</a>, for example:</p>
<table>
<thead>
<tr>
<th>Token</th>
<th>Treebank POS tag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>The</td>
<td>DT</td>
<td>Determiner</td>
</tr>
<tr>
<td>quick</td>
<td>JJ</td>
<td>Adjective</td>
</tr>
<tr>
<td>brown</td>
<td>JJ</td>
<td>Adjective</td>
</tr>
<tr>
<td>fox</td>
<td>NN</td>
<td>Noun (singular)</td>
</tr>
<tr>
<td>jumped</td>
<td>VBD</td>
<td>Verb (past-tense)</td>
</tr>
<tr>
<td>over</td>
<td>IN</td>
<td>Preposition</td>
</tr>
<tr>
<td>the</td>
<td>DT</td>
<td>Determiner</td>
</tr>
<tr>
<td>lazy</td>
<td>JJ</td>
<td>Adjective</td>
</tr>
<tr>
<td>dogs</td>
<td>NNS</td>
<td>Nound (plural)</td>
</tr>
<tr>
<td>.</td>
<td>.</td>
<td>Punctuation</td>
</tr>
</tbody>
</table>
<p>In a later article, we’ll see how we can search on these POS tags. For now I will focus on how we can achieve the above result.</p>
<p>Lucene’s OpenNLP library contains only the basic building bricks for building your own analyzer, it does not contain a proper analyzer itself. For this reason, I created my own extending Lucene’s abstract Analyzer class.</p>
<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">org.apache.lucene.analysis.Analyzer</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.lucene.analysis.Tokenizer</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.lucene.analysis.TokenFilter</span><span class="o">;</span>
<span class="kd">public</span> <span class="kd">class</span> <span class="nc">OpenNLPAnalyzer</span> <span class="kd">extends</span> <span class="nc">Analyzer</span> <span class="o">{</span>
<span class="kd">protected</span> <span class="nc">TokenStreamComponents</span> <span class="nf">createComponents</span><span class="o">(</span><span class="nc">String</span> <span class="n">fieldName</span><span class="o">)</span> <span class="o">{</span>
<span class="nc">Tokenizer</span> <span class="n">tokenizer</span> <span class="o">=</span> <span class="o">...;</span> <span class="c1">// TODO</span>
<span class="nc">TokenFilter</span> <span class="n">tokenFilter</span> <span class="o">=</span> <span class="o">...;</span> <span class="c1">// TODO</span>
<span class="k">return</span> <span class="k">new</span> <span class="nf">TokenStreamComponents</span><span class="o">(</span><span class="n">tokenizer</span><span class="o">,</span> <span class="n">tokenFilter</span><span class="o">);</span>
<span class="o">}</span>
<span class="o">}</span>
</code></pre></div></div>
<p>As mentioned above, the Analyzer’s task is to create a token stream, but not to execute it (just yet). Creation is done in the <em>createComponents</em> method. In case we want to apply additional filter to the token stream, the class <em>TokenStreamComponents</em> can be used. The constructor of <em>TokenStreamComponents</em> takes two arguments: a tokenizer and a token filter (which, as said above, happens to be of type <em>Tokenizer</em> too). Note that in many cases it’s not necessary to make use of the method argument <em>String fieldName</em>.</p>
<p>First thing we need is a Tokenizer. Lucene OpenNLP library has a class OpenNLPTokenizer:</p>
<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">package</span> <span class="nn">org.apache.lucene.analysis.opennlp</span><span class="o">;</span>
<span class="kd">public</span> <span class="kd">class</span> <span class="nc">OpenNLPTokenizer</span> <span class="kd">extends</span> <span class="nc">SegmentingTokenizerBase</span> <span class="cm">/* which itself extends Tokenizer */</span> <span class="o">{</span>
<span class="kd">public</span> <span class="nf">OpenNLPTokenizer</span><span class="o">(</span><span class="nc">AttributeFactory</span> <span class="n">factory</span><span class="o">,</span> <span class="nc">NLPSentenceDetectorOp</span> <span class="n">sentenceOp</span><span class="o">,</span> <span class="nc">NLPTokenizerOp</span> <span class="n">tokenizerOp</span><span class="o">)</span> <span class="kd">throws</span> <span class="nc">IOException</span> <span class="o">{</span>
<span class="c1">// ....</span>
<span class="o">}</span>
<span class="o">}</span>
</code></pre></div></div>
<p>The constructor takes three arguents:</p>
<ul>
<li><em>AttributeFactory</em>: An attribute factory is simply a factory that creates instances of <em>Attribute</em>. Lucene has a very nice default implementation that is sufficient in most cases <em>AttributeFactory.DEFAULT_ATTRIBUTE_FACTORY</em>. A Lucene attribute consists of an interface and an implementing class with the suffix ‘Impl’. The default attribute factory uses reflection to find the proper implementing class for an attribute interface. The convention is: For an interface <em>FooAttribute extends Attribute</em> there must exist an implementation class <em>FooAttributeImpl extends AttributeImpl implements FooAttribute</em>. You will have to write your own attribute factory only in case you plan to not follow this naming convention for any reason.</li>
<li><em>NLPSentenceDetectorOp</em>: This is a class performing sentence detection provided by Lucene OpenNLP library.</li>
<li><em>NLPTokenizerOp</em>: This is a class performing tokenization provided by Lucene OpenNLP library.</li>
</ul>
<p>In order to implement all NLP*Op classes we can use OpenNLPOpsFactory’s static create methods. These methods load the corresponding <a href="https://opennlp.apache.org/models.html">language-specific OpenNLP model files</a> (see also <a href="http://opennlp.sourceforge.net/models-1.5/">here</a>) using a resource loader.</p>
<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">opennlp.tools.tokenize.TokenizerModel</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">opennlp.tools.sentdetect.SentenceModel</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.lucene.analysis.opennlp.tools.NLPSentenceDetectorOp</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.lucene.analysis.opennlp.tools.NLPTokenizerOp</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.lucene.analysis.opennlp.tools.OpenNLPOpsFactory</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.lucene.analysis.opennlp.OpenNLPTokenizer</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.lucene.analysis.util.ClasspathResourceLoader</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.lucene.analysis.util.ResourceLoader</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.lucene.util.AttributeFactory</span><span class="o">;</span>
<span class="c1">// Some code...</span>
<span class="c1">// Get the current resource loader</span>
<span class="nc">ResourceLoader</span> <span class="n">resourceLoader</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">ClasspathResourceLoader</span><span class="o">(</span><span class="nc">ClassLoader</span><span class="o">.</span><span class="na">getSystemClassLoader</span><span class="o">());</span>
<span class="c1">// Load OpenNLP's tokenizer model file using the resource loader and the OpenNLPOpsFactory</span>
<span class="nc">TokenizerModel</span> <span class="n">tokenizerModel</span> <span class="o">=</span> <span class="nc">OpenNLPOpsFactory</span><span class="o">.</span><span class="na">getTokenizerModel</span><span class="o">(</span><span class="s">"en-token.bin"</span><span class="o">,</span> <span class="n">resourceLoader</span><span class="o">);</span>
<span class="c1">// Instantiate the NLPTokenizerOp using the model</span>
<span class="nc">NLPTokenizerOp</span> <span class="n">tokenizerOp</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">NLPTokenizerOp</span><span class="o">(</span><span class="n">tokenizerModel</span><span class="o">);</span>
<span class="c1">// Load OpenNLP's sentence detection model file using the resource loader and the OpenNLPOpsFactory</span>
<span class="nc">SentenceModel</span> <span class="n">sentenceModel</span> <span class="o">=</span> <span class="nc">OpenNLPOpsFactory</span><span class="o">.</span><span class="na">getSentenceModel</span><span class="o">(</span><span class="s">"en-sent.bin"</span><span class="o">,</span> <span class="n">resourceLoader</span><span class="o">);</span>
<span class="c1">// Instantiate the NLPSentenceDetectorOp using the model</span>
<span class="nc">NLPSentenceDetectorOp</span> <span class="n">sentenceDetectorOp</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">NLPSentenceDetectorOp</span><span class="o">(</span><span class="n">sentenceModel</span><span class="o">);</span>
<span class="c1">// Instantiate the OpenNLPTokenizer</span>
<span class="nc">Tokenizer</span> <span class="n">tokenizer</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">OpenNLPTokenizer</span><span class="o">(</span><span class="nc">AttributeFactory</span><span class="o">.</span><span class="na">DEFAULT_ATTRIBUTE_FACTORY</span><span class="o">,</span> <span class="n">sentenceDetectorOp</span><span class="o">,</span> <span class="n">tokenizerOp</span><span class="o">);</span>
</code></pre></div></div>
<p>Awesome, we have a tokenizer now that can detect sentences and create word tokens. Next we want to create our POS token filter. After what we’ve seen It’s pretty straight forward.</p>
<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">opennlp.tools.postag.POSModel</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.lucene.analysis.TokenFilter</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.lucene.analysis.opennlp.OpenNLPPOSFilter</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.lucene.analysis.opennlp.tools.NLPPOSTaggerOp</span><span class="o">;</span>
<span class="c1">// Some code...</span>
<span class="c1">// Load OpenNLP's POS model file using the resource loader and the OpenNLPOpsFactory. Consult</span>
<span class="c1">// the OpenNLP documentation for the difference between maxent and perceptron model.</span>
<span class="nc">POSModel</span> <span class="n">posModel</span> <span class="o">=</span> <span class="nc">OpenNLPOpsFactory</span><span class="o">.</span><span class="na">getPOSTaggerModel</span><span class="o">(</span><span class="s">"en-pos-maxent.bin"</span><span class="o">,</span> <span class="n">resourceLoader</span><span class="o">);</span>
<span class="c1">// Instantiate the NLPPOSTaggerOp using the model</span>
<span class="nc">NLPPOSTaggerOp</span> <span class="n">posTaggerOp</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">NLPPOSTaggerOp</span><span class="o">(</span><span class="n">posModel</span><span class="o">);</span>
<span class="c1">// Instantiate the token filter using the OpenNLPTokenizer and the NLPPOSTaggerOp instances</span>
<span class="nc">TokenFilter</span> <span class="n">filter</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">OpenNLPPOSFilter</span><span class="o">(</span><span class="n">tokenizer</span><span class="o">,</span> <span class="n">posTaggerOp</span><span class="o">);</span>
</code></pre></div></div>
<p>Nice! We can put everything together now.</p>
<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">package</span> <span class="nn">com.example</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">opennlp.tools.postag.POSModel</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">opennlp.tools.sentdetect.SentenceModel</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">opennlp.tools.tokenize.TokenizerModel</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.lucene.analysis.Analyzer</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.lucene.analysis.TokenFilter</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.lucene.analysis.Tokenizer</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.lucene.analysis.opennlp.OpenNLPPOSFilter</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.lucene.analysis.opennlp.OpenNLPTokenizer</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.lucene.analysis.opennlp.tools.NLPPOSTaggerOp</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.lucene.analysis.opennlp.tools.NLPSentenceDetectorOp</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.lucene.analysis.opennlp.tools.NLPTokenizerOp</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.lucene.analysis.opennlp.tools.OpenNLPOpsFactory</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.lucene.analysis.util.ClasspathResourceLoader</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.lucene.analysis.util.ResourceLoader</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.lucene.util.AttributeFactory</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">java.io.IOException</span><span class="o">;</span>
<span class="kd">public</span> <span class="kd">class</span> <span class="nc">OpenNLPAnalyzer</span> <span class="kd">extends</span> <span class="nc">Analyzer</span> <span class="o">{</span>
<span class="kd">protected</span> <span class="nc">TokenStreamComponents</span> <span class="nf">createComponents</span><span class="o">(</span><span class="nc">String</span> <span class="n">fieldName</span><span class="o">)</span> <span class="o">{</span>
<span class="k">try</span> <span class="o">{</span>
<span class="nc">ResourceLoader</span> <span class="n">resourceLoader</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">ClasspathResourceLoader</span><span class="o">(</span><span class="nc">ClassLoader</span><span class="o">.</span><span class="na">getSystemClassLoader</span><span class="o">());</span>
<span class="nc">TokenizerModel</span> <span class="n">tokenizerModel</span> <span class="o">=</span> <span class="nc">OpenNLPOpsFactory</span><span class="o">.</span><span class="na">getTokenizerModel</span><span class="o">(</span><span class="s">"en-token.bin"</span><span class="o">,</span> <span class="n">resourceLoader</span><span class="o">);</span>
<span class="nc">NLPTokenizerOp</span> <span class="n">tokenizerOp</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">NLPTokenizerOp</span><span class="o">(</span><span class="n">tokenizerModel</span><span class="o">);</span>
<span class="nc">SentenceModel</span> <span class="n">sentenceModel</span> <span class="o">=</span> <span class="nc">OpenNLPOpsFactory</span><span class="o">.</span><span class="na">getSentenceModel</span><span class="o">(</span><span class="s">"en-sent.bin"</span><span class="o">,</span> <span class="n">resourceLoader</span><span class="o">);</span>
<span class="nc">NLPSentenceDetectorOp</span> <span class="n">sentenceDetectorOp</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">NLPSentenceDetectorOp</span><span class="o">(</span><span class="n">sentenceModel</span><span class="o">);</span>
<span class="nc">Tokenizer</span> <span class="n">source</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">OpenNLPTokenizer</span><span class="o">(</span>
<span class="nc">AttributeFactory</span><span class="o">.</span><span class="na">DEFAULT_ATTRIBUTE_FACTORY</span><span class="o">,</span> <span class="n">sentenceDetectorOp</span><span class="o">,</span> <span class="n">tokenizerOp</span><span class="o">);</span>
<span class="nc">POSModel</span> <span class="n">posModel</span> <span class="o">=</span> <span class="nc">OpenNLPOpsFactory</span><span class="o">.</span><span class="na">getPOSTaggerModel</span><span class="o">(</span><span class="s">"en-pos-maxent.bin"</span><span class="o">,</span> <span class="n">resourceLoader</span><span class="o">);</span>
<span class="nc">NLPPOSTaggerOp</span> <span class="n">posTaggerOp</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">NLPPOSTaggerOp</span><span class="o">(</span><span class="n">posModel</span><span class="o">);</span>
<span class="nc">TokenFilter</span> <span class="n">filter</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">OpenNLPPOSFilter</span><span class="o">(</span><span class="n">source</span><span class="o">,</span> <span class="n">posTaggerOp</span><span class="o">);</span>
<span class="k">return</span> <span class="k">new</span> <span class="nf">TokenStreamComponents</span><span class="o">(</span><span class="n">source</span><span class="o">,</span> <span class="n">filter</span><span class="o">);</span>
<span class="o">}</span>
<span class="k">catch</span> <span class="o">(</span><span class="nc">IOException</span> <span class="n">e</span><span class="o">)</span> <span class="o">{</span>
<span class="c1">// Do something...</span>
<span class="o">}</span>
<span class="o">}</span>
<span class="o">}</span>
</code></pre></div></div>
<p>This is it, we’ve created our own (POS) analyzer for Lucene. Let’s see it in action. Note that our analyzer writes POS tags into the token’s TypeAttribute.</p>
<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">com.example.OpenNLPanalyzer</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.lucene.analysis.TokenStream</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.lucene.analysis.tokenattributes.CharTermAttribute</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.lucene.analysis.tokenattributes.TypeAttribute</span><span class="o">;</span>
<span class="c1">// Some code...</span>
<span class="kd">final</span> <span class="nc">String</span> <span class="n">text</span> <span class="o">=</span> <span class="s">"The quick brown fox jumped over the lazy dogs."</span><span class="o">;</span>
<span class="nc">OpenNLPAnalyzer</span> <span class="n">analyzer</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">OpenNLPAnalyzer</span><span class="o">();</span>
<span class="c1">// We're using a string reader to return a token stream. This allows us to observe the tokens</span>
<span class="c1">// while they are being processed by our analyzer.</span>
<span class="nc">TokenStream</span> <span class="n">stream</span> <span class="o">=</span> <span class="n">analyzer</span><span class="o">.</span><span class="na">tokenStream</span><span class="o">(</span><span class="s">"field"</span><span class="o">,</span> <span class="k">new</span> <span class="nc">StringReader</span><span class="o">(</span><span class="n">text</span><span class="o">));</span>
<span class="c1">// CharTermAttribute will contain the actual token word</span>
<span class="nc">CharTermAttribute</span> <span class="n">termAtt</span> <span class="o">=</span> <span class="n">stream</span><span class="o">.</span><span class="na">addAttribute</span><span class="o">(</span><span class="nc">CharTermAttribute</span><span class="o">.</span><span class="na">class</span><span class="o">);</span>
<span class="c1">// TypeAttribute will contain the OpenNLP POS (Treebank II) tag</span>
<span class="nc">TypeAttribute</span> <span class="n">typeAtt</span> <span class="o">=</span> <span class="n">stream</span><span class="o">.</span><span class="na">addAttribute</span><span class="o">(</span><span class="nc">TypeAttribute</span><span class="o">.</span><span class="na">class</span><span class="o">);</span>
<span class="k">try</span> <span class="o">{</span>
<span class="n">stream</span><span class="o">.</span><span class="na">reset</span><span class="o">();</span>
<span class="c1">// Print all tokens until stream is exhausted</span>
<span class="k">while</span> <span class="o">(</span><span class="n">stream</span><span class="o">.</span><span class="na">incrementToken</span><span class="o">())</span> <span class="o">{</span>
<span class="nc">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="n">termAtt</span><span class="o">.</span><span class="na">toString</span><span class="o">()</span> <span class="o">+</span> <span class="s">": "</span> <span class="o">+</span> <span class="n">typeAtt</span><span class="o">.</span><span class="na">type</span><span class="o">());</span>
<span class="o">}</span>
<span class="n">stream</span><span class="o">.</span><span class="na">end</span><span class="o">();</span>
<span class="o">}</span> <span class="k">finally</span> <span class="o">{</span>
<span class="n">stream</span><span class="o">.</span><span class="na">close</span><span class="o">();</span>
<span class="o">}</span>
</code></pre></div></div>
<p>Printed output:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>The: DT
quick: JJ
brown: JJ
fox: NN
jumped: VBD
over: IN
the: DT
lazy: JJ
dogs: NNS
.: .
</code></pre></div></div>
<p>In <a href="/2018/10/01/introduction-to-lucene-opennlp-part2/">part 2 of the tutorial</a> we’ll have a look how to search on the POS tags.</p>
<hr />
<h2 id="other-articles">Other articles</h2>
<h3 id="next">Next</h3>
<ul>
<li>Part 2: <a href="/2018/10/01/introduction-to-lucene-opennlp-part2/">Introduction to Lucene 7 OpenNLP - Part 2</a></li>
<li>Part 3: <a href="/2018/10/06/introduction-to-lucene-opennlp-part3/">Introduction to Lucene 7 OpenNLP - Part 3</a></li>
<li>Part 4: <a href="/2018/10/07/introduction-to-lucene-opennlp-part4/">Introduction to Lucene 7 OpenNLP - Part 4</a></li>
</ul>
<h3 id="official-lucene-documentation">Official Lucene documentation</h3>
<p>The best way to understand Lucene is probably to read through its official documentation. Each Lucene package is shipped with the relevant topics for the package. I recommend reading through these ones for a start:</p>
<ol>
<li><a href="https://lucene.apache.org/core/7_4_0/core/org/apache/lucene/document/package-summary.html#package.description">Documentation for package org.apache.lucene.document</a></li>
<li><a href="https://lucene.apache.org/core/7_4_0/core/org/apache/lucene/index/package-summary.html">Documentation for package org.apache.lucene.index</a></li>
<li><a href="https://lucene.apache.org/core/7_4_0/core/org/apache/lucene/analysis/package-summary.html">Documentation for package org.apache.Lucene.analysis</a></li>
<li><a href="https://lucene.apache.org/core/7_4_0/core/org/apache/lucene/search/package-summary.html">Documentation for package org.apache.lucene.search</a></li>
<li><a href="https://lucene.apache.org/core/7_4_0/core/overview-summary.html">Documentation for other Lucene packages</a></li>
</ol>
<h3 id="various">Various</h3>
<ul>
<li><a href="https://github.com/apache/lucene-solr/blob/master/solr/solr-ref-guide/src/language-analysis.adoc#opennlp-integration">SolR’s OpenNLP integration for language analysis</a></li>
<li><a href="https://opensourceconnections.com/blog/2018/08/06/intro_solr_nlp_integrations/">Exploring SolR’s OpenNLP integrations</a></li>
<li><a href="https://citrine.io/2015/02/15/building-a-custom-analyzer-in-lucene/">Building a custom analyzer, tokenizer and token filter in Lucene</a></li>
<li><a href="https://www.toptal.com/database/full-text-search-of-dialogues-with-apache-lucene">Full Text Search of Dialogues with Apache Lucene: A Tutorial</a></li>
<li><a href="http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html">Lucene’s TokenStreams are actually graphs!</a></li>
</ul>
Rhizomes as transistors2017-02-05T00:00:00+01:00http://fabian-kostadinov.github.io/2017/02/05/rhizomes-as-transistors<p>There is a natural similarity between a rhizome’s relation and how modern transistors work. As it turns out, we could actually build a rhizome (with limitations) using transistors. Have a look at the following example.<span class="more"></span>
Every transistor consists of a collector, an emitter and a base.</p>
<p><img src="/public/img/2017-02-05-rhizome-relation-as-transistor.png" alt="Two rhizome relations as a transistor" title="Two rhizome relations used as a transistor" /></p>
<ul>
<li>r2 <= (r3, r4): We will call relatum r3 the <em>collector</em>, and relatum r4 the <em>emitter</em>. Another relation’s emitter, r2, will act as the <em>base</em>.</li>
<li>r0 <= (r1, r2): A second relation is used to act as the <em>base</em> for r2. Note that this relation again can be understood as consisting of a collector r1 and an emitter r2, yet without a base.</li>
</ul>
<p>Every relation can either conduct a current (it can “fire”, if we want to use more neuroscience terms) or not.</p>
<p>#Building a NOT-gate (inverter) with rhizome relations
How then can we use a rhizome to compute? Well, knowing really just the basics of electronics is sufficient to understand the basics.</p>
<p>By adding more relations, we can actually build a NOT-gate (inverter). Let’s assume the same relations as before. This time, we however add two more relations.</p>
<ul>
<li>r2 <= (r3, r4): With r3 being the collector, r4 the emitter and r2 the place where the base is attached. The base itself is again represented as r0 <= (r1, r2).</li>
<li>r6 <= (r3, r7): With r3 being the collector, r7 the emitter and r6 the place where the base is attached. The base itself is again represented as r5 <= (r1, r6).</li>
</ul>
<p><img src="/public/img/2017-02-05-not-gate.png" alt="NOT Gate" title="NOT gate" /></p>
<p>In the collector r1 we either conduct a current through r0 to emitter r2, or we conduct a current through r5 to emitter r6. At the same time, we conduct a current from collector r3 through r2 to emitter r4 as well as from r3 through r6 to emitter r7. Depending on which base is currently active, either r4 or r7 will be active, but not both at the same time. Hence, we have created a simple logical NOT gate.</p>
<p>Now, the situation is actually not as simple as I just described. In a way I tricked a little. I have left two questions unanswered so far: First, how is it decided in collector r1 whether to conduct a current through one or the other relation? Second, what if suddenly a current would be conducted (or not conducted) through both relations at the same time?<br />
To answer the first question: The decision to either activate one or the other relation r0 <= (r1, r2) or r5 <= (r1, r6), i.e. the “base relations” from the perspective or r2 and r6, is not stored or controlled from inside the depicted rhizome at all. It must actually be the application’s responsibility to decide how/when to switch conducts between one or the other base relation. In the simplest possible case we can use another existing common hardware NOT gate that can produce exactly one or another output (but not both at the same time). <br />
In a more sophisticated world however we could separate the two decisions whether to activate one or the other relation. They could then be activated independently of each other. This immediately leads to an interesting paradoxon: the emittors r4 and r7 could potentially be activated or deactivated <em>at the same time</em>. In other words, the system would <strong>at the same time produce two output signals that is a value and its negation</strong>, for example <code>true</code> and <code>false</code> (or alternatively <code>false</code> and <code>true</code>) concurrently. From the perspective of traditional boolean logic this makes no sense at all. First, something and its negation cannot coexist, it’s either one or the other and nothing else (<a href="https://en.wikipedia.org/wiki/Law_of_excluded_middle">law of excluded third</a>). Second, there can not be two values at the same time (<a href="https://en.wikipedia.org/wiki/Law_of_noncontradiction">law of noncontradiction</a>). Yet, this is exactly what our rhizome produces in such a situation. So, what do we make of it?<br />
I will skip the philosophical questions that may arise and focus only on the technical aspects. Basically, what is required is a strategy to handle such situations. Note that no such strategy is predefined anywhere in the existing rhizome, so the strategy must again come from the outside, it cannot be predetermined by the rhizome itself. Unfortunately, I must leave the question unanswered where this strategy actually comes from or how it’s represented exactly as this is not entirely clear to me neither at this point. The goal must be to establish one or several new relations of higher order than the existing relations which actually determine how the situation must be handled. For example, the strategy may establish a new relation r8 <= (r0, r5) or r8 <= (r5, r0), or maybe a relation between r2 and r6, or perhaps directly between r4 and r7. Whenever an unprecedented or contradictory situation occurs new relations are added to the rhizome to handle exactly this situation. Perhaps new relations could be added randomly, and those that are unused for some time or that do not lead to successful strategic behaviour are eventually forgotten. Whatever the case, as a consequence the rhizome grows in size over time. Yet, relations of higher order tend to also handle more complexity, so assuming a relatively constant exposure to outside stimuli the growth should follow a logarithmic pattern.</p>
Yes, but I'm a software engineer...2016-11-29T00:00:00+01:00http://fabian-kostadinov.github.io/2016/11/29/yes-but-i-m-a-software-engineer<p>Being a software engineer is not an excuse to complain about your data science colleagues. You don’t have to become an expert in machine-learning or statistical analysis, but it’s actually a lot of fun to dive a little deeper into some of these topics and learn more about, let’s say, categorisation algorithms. And it even looks sexy in your CV.<span class="more"></span></p>
<p>Just because Java is your first language does not mean it’s so much cooler/better/more stable than Python or R. Have you ever tried to implement advanced matrix algebra in Java? It’s a pain.</p>
<p>I know there exists Apache Math that claims to do exactly that. But this still does not change that it’s a pain to do data science in Java.</p>
<p>Knowing how k-means algorithm work does not yet make you a data scientist. There’s quite a difference between someone who understands the underlying math and someone who does not.</p>
<p>Being a software engineer you should actually be concerned about your data scientist colleagues. There’s plenty areas where they can profit from your help, for example checking in code to Git.</p>
<p>Not everyone needs to understand how application servers works. They’re anyway somewhat outdated, I’d say. Not that they will disappear suddenly, but do you really need an application container when you can have <a href="https://github.com/docker/docker">Docker containers</a>?</p>
<p>Your data science colleagues might not know what a data lake is. Nor why they should bother about it. Only once they’ve seen it will they become to love it. So, don’t count on their enthusiastic support during the inception or implementation phase.</p>
<p>You should learn Scala. (And when you’re onto it, add a bit of Clojure too. It’s like taking LSD. It will open your eyes to a reality that’s truer than what you believed true.)</p>
<p>Data warehouse is not a data lake. OLAP cube is not a data lake neither. Hadoop is not a data lake. (It’s just a technology stack to build one.) In fact, nobody knows what a data lake is or why they’d need it until they’ve seen one.</p>
Yes, but I'm a data scientist...2016-11-29T00:00:00+01:00http://fabian-kostadinov.github.io/2016/11/29/yes-but-i-m-a-data-scientist<p>Being a data scientist is no excuse for writing sloppy code. Yeah, I know that Java is not your first coding language, but you should really not write spaghetti code.<span class="more"></span></p>
<p>Being a data scientist does not mean you don’t have to check in your code to Git. Oh, and this definitely also includes your iPython notebooks. Yes, that’s considered code too.</p>
<p>There is no need to demonstrate your superior Python skills to everyone else in the company by writing low-level code because you absolutely cannot do with the standard libraries. Use the libraries, you fool.</p>
<p>Yet, of course, being a data scientist does not mean you should use your favourite open source library just because it’s so much cooler/faster/better than anyone else’s. Have you ever considered the legal implications of using a GNU licensed library in your enterprise environment? So you’re certain your code will never leave the enterprise?</p>
<p>Just because you love R it’s still not a good idea to call an R function from within productive Java code. Really, it isn’t.</p>
<p>Learn your language. R is not per se fast just because it contains lots of fancy, advanced statistical functions that other languages don’t have. There’s actually very few people who really understand what’s going on in a typical piece of R code.</p>
<p>Being a data scientist you should actually be concerned about your software engineering colleagues. They’re not dumb asses, they just studied a different subject at university. Just because you know the maths behind SVMs better than they do does not mean you should build them in Python using low-level functions when there are alternatives around that will make the life of your software engineering colleagues much easier because it actually means they don’t have to rewrite your entire code in a different language like Java.</p>
<p>Creating a model does not mean it won’t have to be maintained.</p>
<p>There’s actually a reason why such things as application servers exist, even if you might not know them.</p>
<p>HTFS and Spark is not solution architecture.</p>
<p>You should learn Scala.</p>
<p>Like it or not, being a data scientist does not mean you will never have to sell, promote and explain your work to a bunch of ignorant managers. Just keep in mind, they’re the ones who decide about your salary and promotion. Not you.</p>
How to implement a text mining engine2016-04-18T00:00:00+02:00http://fabian-kostadinov.github.io/2016/04/18/how-to-implement-a-text-mining-engine<p>There are various text mining libraries, packages and tools available, many of them as freeware. Yet, when it comes to putting it all together in an enterprise environment, there is actually not too much information available on the web. This article is about how I would design a general-purpose text mining engine that is fit for today’s standard Java-stack enterprise environment and the typical problems one encounters in these environments. A lot of what I write below is from hands-on experience with existing tools and the typical difficulties I had.<span class="more"></span></p>
<h1 id="tasks-and-components">Tasks and components</h1>
<p>Text mining tasks typically follow a similar.</p>
<ol>
<li>Data must be obtained from a data source, for example a file system or document database.</li>
<li>Sometimes, the the original documents have only been scanned and saved as image files (e.g. TIFF files, or PDF/Word docs containing images). In such situations, OCRing (<a href="https://en.wikipedia.org/wiki/Optical_character_recognition">optical character recognition</a>) is required for preprocessing the document to text data.</li>
<li>Often, the input files are not stored as plain text data, but must first be converted from an original file format (often PDF or Word) into plain text files.</li>
<li>Filtering is applied to remove undesired text. For example, in HTML/XML files all markup is removed.</li>
<li>Once plain/raw text data is obtained, the text is tokenized with the support of a text mining library. The unit of work can be individual words, sentences, or paragraphs.</li>
<li>The core text mining tasks are executed on the tokens. This is highly specific to the problem at hand. Typical tasks involve entity extraction, calculating term frequency/inverted document frequency (TF/IDF) measures as an input to algorithms like bag of words and so on.</li>
<li>An output string containing the relevant information is assembled. This is either stored somewhere or sent back to the originator system as a response.</li>
</ol>
<p>From the tasks above we can deduce some of the components our text mining engine must contain.</p>
<ol>
<li>A component to receive data (push), or load it (pull) from somewhere. After processing we might also want to store or send processing output somewhere.</li>
<li>Possibly an OCRing component.</li>
<li>A component that extracts raw text from various file formats.</li>
<li>A filtering or text cleanup component.</li>
<li>A tokenization component.</li>
<li>The core text mining component(s).</li>
<li>A component assembling a response that can be sent to the client.</li>
<li>A component that glues all of this together.</li>
</ol>
<p>Furthermore, we need to be able to handle irregular demand patterns. Especially OCRing uses a lot of CPU power, and we cannot fill up our server endlessly with additional load. Hence, we also want to use a queue where we can enter new text mining tasks, which can be polled regularly for a status update. A queue also allows us get away from synchronous request handling, i.e. to shut down network connections in between sending a task and receiving a response.</p>
<p>These are the components/tasks we identified so far.
<img src="/public/img/2016-04-18-text-mining-components-and-tasks.png" alt="Text mining components and tasks" title="Text mining components and tasks" /></p>
<h1 id="architecture">Architecture</h1>
<h2 id="standard-jee-architecture">Standard JEE architecture</h2>
<p>It should already be obvious that there are several software design patterns which could potentially be useful in such a situation, for example:</p>
<ul>
<li><a href="https://en.wikipedia.org/wiki/Chain-of-responsibility_pattern">chain of responsibility pattern</a>,</li>
<li><a href="https://en.wikipedia.org/wiki/Pipeline_(software)">pipeline and filter pattern</a> which is closely related to chain of responsibility,</li>
<li><a href="https://en.wikipedia.org/wiki/Decorator_pattern">decorator pattern</a> as a potential alternative to pipeline and filter or chain of responsibility,</li>
<li>parameter object pattern to pass values between various components/tasks,</li>
<li><a href="https://en.wikipedia.org/wiki/Builder_pattern">builder pattern</a> to build the pipeline of components/tasks.</li>
</ul>
<p>The next chart shows how I would design such a system relying on a rather classical JEE architecture.</p>
<p><img src="/public/img/2016-04-18-architecture1.png" alt="Classical JEE architecture" title="Classical JEE architecture" /></p>
<p>The server offers a REST API (servlet) to the client. A business delegate might be used in the client to cover up the connection implementation details. Between client and server one could add a load balancer to scale up horizontally. Furthermore, a queue is added somewhere in the stack (could also be inside the JEE container). Tasks are first added to the queue before being processed further. The queue service can be polled to obtain an up-to-date status of one’s task or the response object once the task completed.
A pipeline and filter pattern is used for the gray components. They all implement the same interface:</p>
<figure class="highlight"><pre><code class="language-java" data-lang="java"><span class="kd">public</span> <span class="kd">interface</span> <span class="nc">Processor</span> <span class="o">{</span>
<span class="kd">public</span> <span class="kt">void</span> <span class="nf">process</span><span class="o">(</span><span class="nc">DocumentContext</span> <span class="n">ctx</span><span class="o">);</span>
<span class="o">}</span></code></pre></figure>
<p>A parameter object pattern is used to pass one object from one processor to the next. Class <code>DocumentContext</code> could for example contain a map of key/value pairs, where the key is the name or ID of the particular <code>Processor</code> and the value the processing output. Each processor simply adds new output to the <code>DocumentContext</code>. Prior output is only deleted in exceptional cases (e.g. throw away binary data that was required as an input to the OCRing processor).<br />
The second JEE container is simply a second identical set up running in a different virtual machine to handle peak loads through the load balancer. More VMs could of course be set up.</p>
<h2 id="processors">Processors</h2>
<p>Every <code>Processor</code> should be implemented as a simple Java bean object, i.e. it should have an empty constructor plus getters and setters for every object property. A dependency injection framework like Spring is used to “glue” all Processors together. Either every <code>Processor</code> keeps a reference to the next in command, or all messages are passed solely through the embracing framework. I personally prefer the second option for reasons that will hopefully become more clear later on. In general I highly recommend implementing each processor without state. A stateless processor offers many advantages over stateful designs. The most important advantage is the inherent thread-safety. Running multiple text mining jobs in parallel in the same JEE application requires the code to be safe for multi-threading, which can be achieved much more easily with stateless designs.</p>
<p>This is an inexhaustive list of various open-source projects delivering value for the different processors.</p>
<table>
<tr>
<th>Component/Processor</th>
<th>Open Source Project</th>
</tr>
<td>Queue</td>
<td><a href="https://www.rabbitmq.com/">RabbitMQ</a>, <a href="http://kafka.apache.org/">Kafka</a>, <a href="http://activemq.apache.org/">ActiveMQ</a>, <a href="https://github.com/twitter-archive/kestrel">Kestrel</a></td>
</tr>
<tr>
<td>Dependency Injection</td>
<td><a href="http://spring.io/">Spring</a>, <a href="https://github.com/codingwell/scala-guice/">Guice</a></td>
</tr>
<tr>
<td>Pipeline and filter</td>
<td><a href="http://commons.apache.org/sandbox/commons-pipeline/">Apache Commons Pipeline</a>, <a href="http://camel.apache.org/">Apache Camel</a></td>
</tr>
<tr>
<td>OCR</td>
<td><a href="https://github.com/tesseract-ocr">Google Tesseract</a></td>
</tr>
<tr>
<td>Text extraction</td>
<td><a href="https://tika.apache.org/">Apache Tika</a></td>
</tr>
<tr>
<td>Tokenization & Text Mining</td>
<td><a href="https://opennlp.apache.org/">OpenNLP</a>, <a href="http://nlp.stanford.edu/software/">Stanford NLP</a>, <a href="http://mallet.cs.umass.edu/">Mallet</a>, <a href="http://www.cs.waikato.ac.nz/ml/weka/">Weka</a><a href="http://www.nltk.org/">Natural Language Toolkit (NLTK, Python)</a></td>
</tr>
</table>
<h2 id="to-osgi-or-not">To OSGi or not</h2>
<p>It would also be possible to make every <code>Processor</code> an <a href="https://en.wikipedia.org/wiki/OSGi">OSGi bundle</a> to allow dynamic updating of existing bundles. This approach was actually taken in the <a href="https://stanbol.apache.org/">Apache Stanbol</a> project. However, I’m not so much convinced of this architectural choice. Why exactly should one be able to update text mining modules on the fly during runtime? If it were just a few parameters per text mining module that might change during runtime, then certainly implementing OSGi for every processor is a huge overhead, there are better ways to handle changing parameters. If you work however in an organization with a typical JEE technology stack then most probably you also have the three environments for development, testing and production. New text mining modules require thorough testing and an official rollout, after all it’s productive code! Of course, rolling out without downtime is a nice idea, but isn’t hot deployment nowadays a standard feature of most modern JEE containers anyway? What do you really need OSGi for? And if you’re still not convinced OSGi adds little value but much unnecessary complexity to your application, consider this: Assume a running and unfinished text mining process. While the process is still ongoing you initiate an OSGi bundle replacement (or hot deployment, for that matter…). What is the exact semantics of your running text mining task now? What is supposed to happen with it? Not properly shutting down your application before deployment might actually complicate things significantly. In case you have a persistent queue in place than can handle a bit of extra load you could simply shut down your server temporarily and let the queue fill up with new tasks without anyone complaining. (I didn’t say this is generally a good idea though… .)</p>
<h2 id="comparison-with-luxid-annotation-server">Comparison with Luxid Annotation Server</h2>
<p>The architecture above is actually similar to <a href="http://www.expertsystem.com/products/luxid-annotation-server/">Expert System’s (formerly Temis) Luxid annotation server</a>. A significant difference is however that in Luxid the core server - called <em>information discovery and extraction (IDE) server</em> - is implemented as a black box. The Luxid IDE server contains an internal (embedded) Jetty server that is not really accessible to a developer. As a consequence it is not possible to run the Luxid IDE server inside another JEE container, the IDE server is always standalone. For this reason many container-managed features are not available to Luxid developers, and possibilities for performance tuning are limited to what was exposed by the vendor beforehand.</p>
<h2 id="modular-architecture-using-apache-camel">Modular architecture using Apache Camel</h2>
<p>The above architecture is not optimal yet. For example, we might want to run OCRing completely separately from all other text mining tasks. Also, a very typical problem in text mining is to load text documents from a source system (database, file system, FTP server, website etc.) and possibly write data back to a target system. And what about implementing non-Java processors? So your text mining expert has developed some code in Python. How can we call NLTK (Python) code to obtain text mining results? In short, we really would like to decouple the various processors even further, so dependency injection is not yet sufficient for what we aim to achieve.<br />
Luckily, there is one framework particularly well suited in our case: <a href="http://camel.apache.org/">Apache Camel</a>. Apache Camel is at its core a routing engine accepting input from a very long list of protocols such as HTTP, FTP, NTFS, etc. This allows loading text files even from remote sources with relative ease, for example load data from an FTP server and then send the processed output to a REST API. Apache Camel uses several of the above mentioned design patterns (dependency injection, builder pattern, parameter object etc.) in combination to achieve loosely coupled systems that are easily changed. It allows the combination of several processors by configuration only (only little coding requried). It can easily be run both inside a standard JEE container as outside in a standalone application. It can call services in remote systems and hence integrate also non-Java applications.</p>
<p><img src="/public/img/2016-04-18-architecture2.png" alt="Apache Camel Architecture" title="Apache Camel Architecture" /></p>
<p>As we can see this architecture is more modular. I have separated the OCR process from the other processor components but left all the other components in the same JEE container where Apache Camel resides itself. This is just a suggestion, of course each processor could run entirely in its own container, there are other combinations possible as well. Whether or not we still need a queue and a load balancer, and where these should be located exactly in the architecture depends a lot on the use case. Apache Camel can be integrated well with different queues.</p>
Reading from and writing to files in Apache Camel2016-01-10T00:00:00+01:00http://fabian-kostadinov.github.io/2016/01/10/reading-from-and-writing-to-files-in-apache-camel<p>I had assumed that reading from and writing to files in <a href="http://camel.apache.org/">Apache Camel v2.16.1</a> should be a straight-forward thing to accomplish. Turns out I was wrong. It took me quite a while to figure out the correct syntax of the <code>from</code> and <code>to</code> commands.<span class="more"></span></p>
<h1 id="reading-a-single-text-file">Reading a single text file</h1>
<p>Before we can use Apache Camel, we need to import it in our pom.xml Maven file:</p>
<figure class="highlight"><pre><code class="language-xml" data-lang="xml"><span class="nt"><dependency></span>
<span class="nt"><groupId></span>org.apache.camel<span class="nt"></groupId></span>
<span class="nt"><artifactId></span>camel-stream<span class="nt"></artifactId></span>
<span class="nt"><version></span>2.16.1<span class="nt"></version></span>
<span class="nt"></dependency></span></code></pre></figure>
<p>There are various ways to read files in Apache Camel. If the files are in plain text format the <code>org.apache.camel.builder.RouteBuilder</code>’s <code>from</code> method is probably the best choice. The <code>from</code> method is overloaded:</p>
<figure class="highlight"><pre><code class="language-java" data-lang="java"><span class="kd">public</span> <span class="nc">RouteDefinition</span> <span class="nf">from</span><span class="o">(</span><span class="nc">Endpoint</span><span class="o">...</span> <span class="n">endpoints</span><span class="o">)</span>
<span class="kd">public</span> <span class="nc">RouteDefinition</span> <span class="nf">from</span><span class="o">(</span><span class="nc">Endpoint</span> <span class="n">endpoint</span><span class="o">)</span>
<span class="kd">public</span> <span class="nc">RouteDefinition</span> <span class="nf">from</span><span class="o">(</span><span class="nc">String</span><span class="o">...</span> <span class="n">uris</span><span class="o">)</span>
<span class="kd">public</span> <span class="nc">RouteDefinition</span> <span class="nf">from</span><span class="o">(</span><span class="nc">String</span> <span class="n">uri</span><span class="o">)</span></code></pre></figure>
<p>Furthermore, there is also a <code>fromF</code> method. I won’t go into details about it:</p>
<figure class="highlight"><pre><code class="language-java" data-lang="java"><span class="kd">public</span> <span class="nc">RouteDefinition</span> <span class="nf">fromF</span><span class="o">(</span><span class="nc">String</span> <span class="n">uri</span><span class="o">,</span> <span class="nc">Object</span><span class="o">...</span> <span class="n">args</span><span class="o">)</span></code></pre></figure>
<p>The RouteBuilder is closely linked with the <code>org.apache.camel.model.RouteDefinition</code> class. It offers a similar interface concerning the <code>from</code> method, but beyond that also has further support for REST APIs:</p>
<figure class="highlight"><pre><code class="language-java" data-lang="java"><span class="kd">public</span> <span class="nc">RouteDefinition</span> <span class="nf">fromRest</span><span class="o">(</span><span class="nc">String</span> <span class="n">uri</span><span class="o">)</span></code></pre></figure>
<p>Unfortunately, the API docs are not explaining a lot. Let’s assume we wanted to read from a file <code>C:\in\MyFile.txt</code>. Let’s be very naive and think that we could actually simply provide the file path to the <code>from</code> (and <code>to</code>) method.</p>
<figure class="highlight"><pre><code class="language-java" data-lang="java"><span class="kn">import</span> <span class="nn">org.apache.camel.CamelContext</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.camel.builder.RouteBuilder</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.camel.impl.DefaultCamelContext</span><span class="o">;</span>
<span class="c1">// ...</span>
<span class="nc">CamelContext</span> <span class="n">ctx</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">DefaultCamelContext</span><span class="o">();</span>
<span class="nc">RouteBuilder</span> <span class="n">route</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">RouteBuilder</span><span class="o">()</span> <span class="o">{</span>
<span class="nd">@Override</span>
<span class="kd">public</span> <span class="kt">void</span> <span class="nf">configure</span><span class="o">()</span> <span class="kd">throws</span> <span class="nc">Exception</span> <span class="o">{</span>
<span class="n">from</span><span class="o">(</span><span class="s">"C:\\in\\MyFile.txt"</span><span class="o">)</span>
<span class="o">.</span><span class="na">to</span><span class="o">(</span><span class="s">"C:\\out\\MyFile.txt"</span><span class="o">);</span>
<span class="o">}</span>
<span class="o">};</span>
<span class="n">ctx</span><span class="o">.</span><span class="na">addRoutes</span><span class="o">(</span><span class="n">route</span><span class="o">);</span>
<span class="n">ctx</span><span class="o">.</span><span class="na">start</span><span class="o">();</span>
<span class="c1">// Maybe sleep a little here</span>
<span class="c1">// Thread.sleep(4000);</span>
<span class="n">ctx</span><span class="o">.</span><span class="na">stop</span><span class="o">();</span></code></pre></figure>
<p>What happens when we execute this code? Actually nothing. The code is executed, but nothing is written to the output directory. No exceptions are thrown, not even a warning message is logged. Not quite what we expected, right?</p>
<p>Looking at the API again, we realize that what is needed is actually not a file path but a file URI. Now, being naive again, we look up the <a href="https://en.wikipedia.org/wiki/File_URI_scheme">Wikipedia article on file URI schemes</a>. Obviously, we forgot to provide the required <code>file://</code> URI prefix. So, let’s try again (omitting some code for brevity).</p>
<figure class="highlight"><pre><code class="language-java" data-lang="java"><span class="kd">public</span> <span class="kt">void</span> <span class="nf">configure</span><span class="o">()</span> <span class="kd">throws</span> <span class="nc">Exception</span> <span class="o">{</span>
<span class="n">from</span><span class="o">(</span><span class="s">"file://C:\\in\\MyFile.txt"</span><span class="o">)</span>
<span class="o">.</span><span class="na">to</span><span class="o">(</span><span class="s">"file://C:\\out\\MyFile.txt"</span><span class="o">);</span>
<span class="o">}</span></code></pre></figure>
<p>Still does not work. Again, no exception, no warning messages. What’s wrong here? Do we need a third slash, i.e. <code>file:///</code>?</p>
<figure class="highlight"><pre><code class="language-java" data-lang="java"><span class="kd">public</span> <span class="kt">void</span> <span class="nf">configure</span><span class="o">()</span> <span class="kd">throws</span> <span class="nc">Exception</span> <span class="o">{</span>
<span class="n">from</span><span class="o">(</span><span class="s">"file:///C:\\in\\MyFile.txt"</span><span class="o">)</span>
<span class="o">.</span><span class="na">to</span><span class="o">(</span><span class="s">"file:///C:\\out\\MyFile.txt"</span><span class="o">);</span>
<span class="o">}</span></code></pre></figure>
<p>Nope, still no success. Maybe double backslashes in file paths are not properly parsed? Next try:</p>
<figure class="highlight"><pre><code class="language-java" data-lang="java"><span class="kd">public</span> <span class="kt">void</span> <span class="nf">configure</span><span class="o">()</span> <span class="kd">throws</span> <span class="nc">Exception</span> <span class="o">{</span>
<span class="n">from</span><span class="o">(</span><span class="s">"file://C:/in/MyFile.txt"</span><span class="o">)</span>
<span class="o">.</span><span class="na">to</span><span class="o">(</span><span class="s">"file://C:/out/MyFile.txt"</span><span class="o">);</span>
<span class="o">}</span></code></pre></figure>
<p>Same result again. This is getting frustrating. All it says in the API documentation of class RouteBuilder:</p>
<blockquote>A <a href="http://camel.apache.org/dsl.html">Java DSL</a> which is used to build DefaultRoute instances in a CamelContext for smart routing.</blockquote>
<h1 id="resources">Resources</h1>
<p>Looking up the website for the <a href="http://camel.apache.org/java-dsl.html">Java DSL</a> docs does not give a clear hint neither. There exists also a <a href="http://camel.apache.org/manual/camel-manual-2.16.1.html">long manual</a>, but we don’t find a lot there neither. And finally, there exists this <a href="http://people.apache.org/~dkulp/camel/file2.html">documentation on the File2 component</a>, which you need to read very carefully to figure out the proper syntax. There’s an article on <a href="http://www.javacodegeeks.com/2013/09/exploring-apache-camel-core-file-component.html">how to create a file poller and process large files</a>. There’s also <a href="http://kevinboone.net/cameltest.html">this article</a> which essentially does not say anything beyond what we already know. If you look around a little you may even find the complete book <em>Apache Camel in Action</em> on the internet, nevertheless things stay obscure.</p>
<h1 id="working-solution">Working solution</h1>
<p>Fast forward. Here’s the working solution. As it turns out, Apache Camel does <strong>not</strong> use traditional file URIs but uses it’s own non-standard file URI format. The trick is to specify the filename as a separate parameter added at the end of the directory path.</p>
<p><em>file:// + <directory path> + ? + fileName= + <filename> + & + <other optional key=value params></em></p>
<p>For example, if the filename is <em>C:\in\MyFile.txt</em>, then the URI would look like one of these (both are valid):</p>
<figure class="highlight"><pre><code class="language-java" data-lang="java"><span class="nl">file:</span><span class="c1">//C:/in/?fileName=MyFile.txt</span>
<span class="nl">file:</span><span class="c1">//C:\\in\\?fileName=MyFile.txt</span></code></pre></figure>
<p>Let’s add a charset parameter to specify the file encoding to be used:</p>
<figure class="highlight"><pre><code class="language-java" data-lang="java"><span class="nl">file:</span><span class="c1">//C:/in/?fileName=MyFile.txt&charset=utf-8</span></code></pre></figure>
<p>Here’s the full example:</p>
<figure class="highlight"><pre><code class="language-java" data-lang="java"><span class="kn">import</span> <span class="nn">org.apache.camel.CamelContext</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.camel.builder.RouteBuilder</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.camel.impl.DefaultCamelContext</span><span class="o">;</span>
<span class="c1">// ...</span>
<span class="nc">CamelContext</span> <span class="n">ctx</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">DefaultCamelContext</span><span class="o">();</span>
<span class="nc">RouteBuilder</span> <span class="n">route</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">RouteBuilder</span><span class="o">()</span> <span class="o">{</span>
<span class="nd">@Override</span>
<span class="kd">public</span> <span class="kt">void</span> <span class="nf">configure</span><span class="o">()</span> <span class="kd">throws</span> <span class="nc">Exception</span> <span class="o">{</span>
<span class="n">from</span><span class="o">(</span><span class="s">"file://C:/in/?fileName=MyFile.txt&charset=utf-8"</span><span class="o">)</span>
<span class="o">.</span><span class="na">to</span><span class="o">(</span><span class="s">"file://C:/out/?fileName=MyFile.txt&charset=utf-8"</span><span class="o">);</span>
<span class="o">}</span>
<span class="o">};</span>
<span class="n">ctx</span><span class="o">.</span><span class="na">addRoutes</span><span class="o">(</span><span class="n">route</span><span class="o">);</span>
<span class="n">ctx</span><span class="o">.</span><span class="na">start</span><span class="o">();</span>
<span class="c1">// Maybe sleep a little here</span>
<span class="c1">// Thread.sleep(4000);</span>
<span class="n">ctx</span><span class="o">.</span><span class="na">stop</span><span class="o">();</span></code></pre></figure>
<h1 id="nooptrue">Noop=true</h1>
<p>Running this example, we observe something interesting. By default, Apache Camel takes the following sequence of steps:</p>
<ol>
<li>Read the input file <em>C:/in/MyFile.txt</em>.</li>
<li>Once read, create a new folder <em>.camel</em> inside the input directory and move the input file into this new directory.</li>
<li>If the output file does not yet exist, create a new one in the output directory. Otherwise, overwrite the existing one.</li>
<li>Write the output file.</li>
</ol>
<p>If you don’t find this behavior useful, then you can adapt it. Let’s tell Apache Camel not to create a <em>.camel</em> directory in the input folder but simply leave the input files as they are. This can be achieved with appending the <code>noop=true</code> parameter.</p>
<figure class="highlight"><pre><code class="language-java" data-lang="java"><span class="kd">public</span> <span class="kt">void</span> <span class="nf">configure</span><span class="o">()</span> <span class="kd">throws</span> <span class="nc">Exception</span> <span class="o">{</span>
<span class="n">from</span><span class="o">(</span><span class="s">"file://C:/in/?fileName=MyFile.txt&charset=utf-8&noop=true"</span><span class="o">)</span>
<span class="o">.</span><span class="na">to</span><span class="o">(</span><span class="s">"file://C:/out/?fileName=MyFile.txt&charset=utf-8"</span><span class="o">);</span>
<span class="o">}</span></code></pre></figure>
<p>There are many more parameters to be used, and they can be looked up in the <a href="http://people.apache.org/~dkulp/camel/file2.html">documentation of the File2 component</a> mentioned above already.</p>
<p>The good news is, this approach even works for non-text files. Let’s assume you need to read from one PDF file and write it to the output directory.</p>
<figure class="highlight"><pre><code class="language-java" data-lang="java"><span class="kd">public</span> <span class="kt">void</span> <span class="nf">configure</span><span class="o">()</span> <span class="kd">throws</span> <span class="nc">Exception</span> <span class="o">{</span>
<span class="n">from</span><span class="o">(</span><span class="s">"file://C:/in/?fileName=MyFile.pdf&noop=true"</span><span class="o">)</span>
<span class="o">.</span><span class="na">to</span><span class="o">(</span><span class="s">"file://C:/out/?fileName=MyFile.pdf"</span><span class="o">);</span>
<span class="o">}</span></code></pre></figure>
<p>It’s as easy as this.</p>
<h1 id="handling-distinct-input-and-output-formats">Handling distinct input and output formats</h1>
<p>This is all good as long as you only intend to process files of the same input and output type. But what if your input file type is different from the target output file type? Neither the core nor the File2 component of Apache Camel provide direct support for such cases. There are different approaches to solve this, but basically all of them come down to file type conversion. Class <code>org.apache.camel.model.RouteDefinition</code> extends class <code>org.apache.camel.model.ProcessorDefinition</code>. ProcessorDefinition in turn offers the following interesting methods:</p>
<figure class="highlight"><pre><code class="language-java" data-lang="java"><span class="kd">public</span> <span class="nc">Type</span> <span class="nf">marshal</span><span class="o">(</span><span class="nc">DataFormat</span> <span class="n">dataFormat</span><span class="o">)</span>
<span class="kd">public</span> <span class="nc">Type</span> <span class="nf">marshal</span><span class="o">(</span><span class="nc">DataFormatDefinition</span> <span class="n">dataFormatDefinition</span><span class="o">)</span>
<span class="kd">public</span> <span class="nc">Type</span> <span class="nf">marshal</span><span class="o">(</span><span class="nc">String</span> <span class="n">dataTypeRef</span><span class="o">)</span>
<span class="kd">public</span> <span class="nc">DataFormatClause</span><span class="o"><</span><span class="nc">ProcessorDefinition</span><span class="o"><</span><span class="nc">Type</span><span class="o">>></span> <span class="nf">unmarshal</span><span class="o">()</span>
<span class="kd">public</span> <span class="nc">Type</span> <span class="nf">unmarshal</span><span class="o">(</span><span class="nc">DataFormat</span> <span class="n">dataFormat</span><span class="o">)</span>
<span class="kd">public</span> <span class="nc">Type</span> <span class="nf">unmarshal</span><span class="o">(</span><span class="nc">DataFormatDefinition</span> <span class="n">dataFormatDefinition</span><span class="o">)</span>
<span class="kd">public</span> <span class="nc">Type</span> <span class="nf">unmarshal</span><span class="o">(</span><span class="nc">String</span> <span class="n">dataTypeRef</span><span class="o">)</span></code></pre></figure>
<p>In Apache Camel, a DataFormat is an object that can marshal and unmarshal another object from one input type to another. This interface offers only two methods:</p>
<figure class="highlight"><pre><code class="language-java" data-lang="java"><span class="kt">void</span> <span class="nf">marshal</span><span class="o">(</span><span class="nc">Exchange</span> <span class="n">exchange</span><span class="o">,</span> <span class="nc">Object</span> <span class="n">graph</span><span class="o">,</span> <span class="nc">OutputStream</span> <span class="n">stream</span><span class="o">)</span> <span class="kd">throws</span> <span class="nc">Exception</span>
<span class="nc">Object</span> <span class="nf">unmarshal</span><span class="o">(</span><span class="nc">Exchange</span> <span class="n">exchange</span><span class="o">,</span> <span class="nc">InputStream</span> <span class="n">stream</span><span class="o">)</span> <span class="kd">throws</span> <span class="nc">Exception</span></code></pre></figure>
<p>It’s your task to implement these methods properly. Once implemented, you can use your version of DataFormat. Imagine you’ve written a PdfTextDataFormat that can marshal back and forth between PDF and text files.</p>
<figure class="highlight"><pre><code class="language-java" data-lang="java"><span class="kd">public</span> <span class="kt">void</span> <span class="nf">configure</span><span class="o">()</span> <span class="kd">throws</span> <span class="nc">Exception</span> <span class="o">{</span>
<span class="n">from</span><span class="o">(</span><span class="s">"file://C:/in/?fileName=MyFile.pdf&noop=true"</span><span class="o">)</span>
<span class="o">.</span><span class="na">unmarshal</span><span class="o">(</span><span class="k">new</span> <span class="nc">PdfTextDataFormat</span><span class="o">())</span>
<span class="o">.</span><span class="na">to</span><span class="o">(</span><span class="s">"file://C:/out/?fileName=MyFile.txt"</span><span class="o">);</span>
<span class="o">}</span></code></pre></figure>
<p>Or the other way round:</p>
<figure class="highlight"><pre><code class="language-java" data-lang="java"><span class="kd">public</span> <span class="kt">void</span> <span class="nf">configure</span><span class="o">()</span> <span class="kd">throws</span> <span class="nc">Exception</span> <span class="o">{</span>
<span class="n">from</span><span class="o">(</span><span class="s">"file://C:/in/?fileName=MyFile.txt&noop=true"</span><span class="o">)</span>
<span class="o">.</span><span class="na">marshal</span><span class="o">(</span><span class="k">new</span> <span class="nc">PdfTextDataFormat</span><span class="o">())</span>
<span class="o">.</span><span class="na">to</span><span class="o">(</span><span class="s">"file://C:/out/?fileName=MyFile.pdf"</span><span class="o">);</span>
<span class="o">}</span></code></pre></figure>
<p>To implement your PdfTextDataFormat’s unmarshal method you must:</p>
<ol>
<li>read the raw file content from the input stream provided,</li>
<li>convert the raw data to a text string,</li>
<li>set the text string as the body of the exchange’s out message.</li>
</ol>
<p>Your code should look something like this:</p>
<figure class="highlight"><pre><code class="language-java" data-lang="java"><span class="kn">import</span> <span class="nn">org.apache.camel.spi.DataFormat</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.commons.io.IOUtils</span><span class="o">;</span>
<span class="kd">public</span> <span class="nc">PdfTextDataFormat</span> <span class="kd">implements</span> <span class="nc">DataFormat</span> <span class="o">{</span>
<span class="kd">public</span> <span class="kt">void</span> <span class="nf">marshal</span><span class="o">(</span><span class="nc">Exchange</span> <span class="n">exchange</span><span class="o">,</span> <span class="nc">Object</span> <span class="n">graph</span><span class="o">,</span> <span class="nc">OutputStream</span> <span class="n">stream</span><span class="o">)</span> <span class="o">{</span> <span class="o">...</span> <span class="o">}</span>
<span class="kd">public</span> <span class="nc">Object</span> <span class="nf">unmarshal</span><span class="o">(</span><span class="nc">Exchange</span> <span class="n">exchange</span><span class="o">,</span> <span class="nc">InputStream</span> <span class="n">stream</span><span class="o">)</span> <span class="kd">throws</span> <span class="nc">Exception</span> <span class="o">{</span>
<span class="kt">byte</span><span class="o">[]</span> <span class="n">bytes</span> <span class="o">=</span> <span class="nc">IOUtils</span><span class="o">.</span><span class="na">toByteArray</span><span class="o">(</span><span class="n">stream</span><span class="o">);</span>
<span class="c1">// Use a tool like PDFBox to create text from your bytes.</span>
<span class="nc">String</span> <span class="n">text</span> <span class="o">=</span> <span class="o">...;</span>
<span class="c1">// If we want, we can set the unmarshalled text back into the exchange's out message</span>
<span class="nc">Message</span> <span class="n">out</span> <span class="o">=</span> <span class="n">exchange</span><span class="o">.</span><span class="na">getOut</span><span class="o">();</span>
<span class="n">out</span><span class="o">.</span><span class="na">setBody</span><span class="o">(</span><span class="n">text</span><span class="o">);</span>
<span class="c1">// Don't close input stream here</span>
<span class="k">return</span> <span class="n">text</span><span class="o">;</span>
<span class="o">}</span>
<span class="o">}</span></code></pre></figure>
<p>The marshalling method would probably look something like this:</p>
<figure class="highlight"><pre><code class="language-java" data-lang="java"><span class="kn">import</span> <span class="nn">org.apache.camel.spi.DataFormat</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.commons.io.IOUtils</span><span class="o">;</span>
<span class="kd">public</span> <span class="nc">PdfTextDataFormat</span> <span class="kd">implements</span> <span class="nc">DataFormat</span> <span class="o">{</span>
<span class="kd">public</span> <span class="kt">void</span> <span class="nf">marshal</span><span class="o">(</span><span class="nc">Exchange</span> <span class="n">exchange</span><span class="o">,</span> <span class="nc">Object</span> <span class="n">graph</span><span class="o">,</span> <span class="nc">OutputStream</span> <span class="n">stream</span><span class="o">)</span> <span class="o">{</span>
<span class="c1">// Don't do this: String s = (String) o;</span>
<span class="c1">// Instead, use Camel type converters like this:</span>
<span class="nc">String</span> <span class="n">s</span> <span class="o">=</span> <span class="n">exchange</span><span class="o">.</span><span class="na">getContext</span><span class="o">().</span><span class="na">getTypeConverter</span><span class="o">().</span><span class="na">mandatoryConvertTo</span><span class="o">(</span><span class="nc">String</span><span class="o">.</span><span class="na">class</span><span class="o">,</span> <span class="n">graph</span><span class="o">);</span>
<span class="c1">// Create a PDF document from the string and convert it into a byte array</span>
<span class="kt">byte</span><span class="o">[]</span> <span class="n">bytes</span> <span class="o">=</span> <span class="o">...;</span>
<span class="nc">IOUtils</span><span class="o">.</span><span class="na">write</span><span class="o">(</span><span class="n">bytes</span><span class="o">,</span> <span class="n">stream</span><span class="o">);</span>
<span class="c1">// Don't close output stream here</span>
<span class="o">}</span>
<span class="kd">public</span> <span class="nc">Object</span> <span class="nf">unmarshal</span><span class="o">(</span><span class="nc">Exchange</span> <span class="n">exchange</span><span class="o">,</span> <span class="nc">InputStream</span> <span class="n">stream</span><span class="o">)</span> <span class="kd">throws</span> <span class="nc">Exception</span> <span class="o">{</span> <span class="o">...</span> <span class="o">}</span>
<span class="o">}</span></code></pre></figure>
<p>In case you only want to do (un-) marshalling in one direction but not in both, it may be a better idea to write a converter processor implementing the <code>org.apache.camel.Processor</code> interface.</p>
<p>Fortunately, you don’t really need to build your own PDF-to-text data format. Instead, you may want to use the <a href="https://github.com/wheijke/camel-tika">camel-tika component</a>. This component is able to unmarshal text from various binary formats (including MS Office documents) to plain text (but not marshalling them in the opposite direction):</p>
<figure class="highlight"><pre><code class="language-xml" data-lang="xml"><span class="nt"><dependendy></span>
<span class="nt"><groupId></span>org.apache.camel<span class="nt"></groupId></span>
<span class="nt"><artifactId></span>camel-tika<span class="nt"></artifactId></span>
<span class="c"><!-- <version>0.2</version> --></span>
<span class="nt"></dependenc></span></code></pre></figure>
<p>You may have to update camel-tika’s pom.xml though, as it seems to not have been updated in a while.</p>
<p>Here’s another blog post on <a href="http://blogs.sourceallies.com/2013/02/getting-started-with-camel-marshalling/">how to do marshalling</a>.</p>
<h1 id="processing-a-directory-of-files">Processing a directory of files</h1>
<p>In case we’d like to process a whole directory of files (without subdirectories), we simply omit the <code>fileName=XYZ</code> parameter.</p>
<figure class="highlight"><pre><code class="language-java" data-lang="java"><span class="kd">public</span> <span class="kt">void</span> <span class="nf">configure</span><span class="o">()</span> <span class="kd">throws</span> <span class="nc">Exception</span> <span class="o">{</span>
<span class="n">from</span><span class="o">(</span><span class="s">"file://C:/in/?noop=true"</span><span class="o">)</span>
<span class="o">.</span><span class="na">to</span><span class="o">(</span><span class="s">"file://C:/out/"</span><span class="o">);</span>
<span class="o">}</span></code></pre></figure>
<p>This command will essentially “copy” all files from <em>C:/in</em> to <em>C:/out</em>. In case the input directory has sub-directories that need to be processed too we simply add the <code>recursive=true</code> parameter: <code>from("file://C:/in?noop=true&recursive=true")</code>.</p>
Temis Luxid 7.0.1 Skill Cartridge Development Cycle2015-10-04T00:00:00+02:00http://fabian-kostadinov.github.io/2015/10/04/temis-luxid701-skill-cartridge-development-cycle<p>Skill cartridges built with Luxid 7 usually contain a mix of customized and standard software artefacts. These artefacts can be data artefacts such as tailored vocabularies or taxonomies, syntactic or similar rules to extract certain types of entities, or they can be a set of configuration files that parameterize the skill cartridge at hand. For this reason, skill cartridges must be treated as productive code and must therefore be subject to a build and deployment process as well as be checked into a version control system. The good news is that Temis has made it really easy to set up your own version of this process. The bad news is that at least in Luxid 7.0.1 there does not seem to exist any documentation on the corresponding tools.<span class="more"></span></p>
<h2 id="skill-cartridge-development-life-cycle">Skill Cartridge Development Life-Cycle</h2>
<p>The first thing to understand is the typical development-cycle of a skill cartridge.</p>
<ol>
<li>
<p>The skill cartridge is shipped by the vendor as a zip file. The zip file contains both a “bare-bone” or empty skill cartridge plus possibly a file called <em>customization.zip</em>. This file again contains one or several default dictionaries that may eventually be loaded into the skill cartridge.</p>
</li>
<li>
<p>Content Enrichment Studio (CES) relies on projects kept inside a (MySQL) database. All of a project’s contents - dictionaries, skill cartridges, sample documents etc - are kept inside the database - not in the file system! When you create a new project, you have to load all the different parts into the database, that is dictionary, skill cartridge and sample documents.</p>
</li>
<li>
<p>Usually, it’s best to first load the knowledge into your project. As said above, skill cartridges like TM360 are shipped with a <em>customization.zip</em> file containing default dictionaries. Simply load this file into your project.</p>
</li>
<li>
<p>Next, you load the empty skill cartridge (e.g. TM360.sca) into your project. At this stage, knowledge and skill cartridge are still separated from each other. Applying the skill cartridge on any documents would not return any dictionary results yet.</p>
</li>
<li>
<p>Hence, you synchronize your skill cartridge with the knowledge in the project. It is now charged, that is it can extract knowledge contained in the dictionaries from text.</p>
</li>
<li>
<p>You load your sample dictionaries into your project.</p>
</li>
<li>
<p>You measure the quality of the extraction results of the skill cartridge. If not yet satisfied, you improve the quality of the extraction output. You can also load additional dictionaries to your project (step 3), but afterwards need to synchronize the skill cartridge again (step 5).</p>
</li>
<li>
<p>When satisfied with the extraction results, you choose to export your charged skill cartridge from the database. A build process is initiated that creates a fully self-contained skill cartridge archive (.sca) file, which is nothing else than a zip file with a specific internal folder structure. You can rename the file to .zip and open it if you wish to do so.</p>
</li>
<li>
<p>At this stage you should actually add your skill cartridge to a version control system, or at least those parts that may have changed if the skill cartridge already existed as a CES project.</p>
</li>
<li>
<p>Once you have a .sca file you can simply upload it to your backend annotation server using either a REST API call (see the Luxid 7 REST Annotation API guide v1.1 for the details) or the admin web app.</p>
</li>
</ol>
<p>As we’ll see below there are tools to support you during this development-cycle.</p>
<p>Once a skill cartridge is installed in a backend annotation server, certain parts of the dictionaries may be still subject to manipulation with a text editor, whereas other parts are represented as binary files and cannot be manipulated anymore. In Luxid the parts that still can be manipulated are called <em>externalized</em> knowledge or dictionaries. If you simply change externalized knowledge, not much will happen to the installed skill cartridge. After making a change you have to actively synchronize the skill cartridge with the externalized knowledge, in other words you have to execute the synchronization step. Some skill cartridges have their own synchronize procedures for this purpose that you can call through a REST API. Be aware that I have never tested this functionality, so I might be wrong in this regard.</p>
<p>As described above, for skill cartridges not installed in the backend annotation server but as parts of CES projects, synchronization is triggered either from inside the CES tools or during the export process. For example in Webstudio go to <em>Project > Configure… > Annotation</em>, then press the <em>Synchronize</em> button.</p>
<h2 id="building-skill-cartridges">Building skill cartridges</h2>
<p>Basically, there are four ways of building your own skill cartridge in Luxid 7. You can</p>
<ul>
<li>use Webstudio,</li>
<li>use Skill Cartridge Builder (SCB) and possibly Annotation Workbench (AWB),</li>
<li>implement your own cartridges in Java or Groovy either as a simple script or as a complete skill cartridge, or</li>
<li>build annotation plans in admin web app by configuration.</li>
</ul>
<h3 id="using-webstudio">Using Webstudio</h3>
<p>As pointed out above, if you build your skill cartridge in Webstudio then exporting your skill cartridge will usually contain at least a vocabulary, lexicon, taxonomy or similar plus the “bare-bone” skill cartridge like STF or TM360.</p>
<p>I believe in Luxid 7.1 the deployment step with Webstudio has become even easier. You can actually deploy your skill cartridge directly from Webstudio to the backend annotation server with a single mouse click. This is cool if your backend annotation server does not run any mission critical system, it may be fatal otherwise. So be careful. Best to have different strictly separated environments for development, testing and production in place.</p>
<h3 id="using-skill-cartridge-builder">Using Skill Cartridge Builder</h3>
<p>In case you’re working with SCB, then the process is actually similar. Just press the export button and your skill cartridge is built together with all its dependencies. Then upload the .sca file to your backend annotation server.</p>
<h3 id="writing-custom-java-code">Writing custom Java code</h3>
<p>When you want to write your own skill cartridges in Java or Groovy you can either write so called “scripts” or build whole applications. A script is simply a single Java or Groovy class that extends a predefined annotator class and has no further dependencies (e.g. to other .jar files). It then provides several method stubs where you can fill in your code. The whole architecture is similar to writing Java servlets, however there are important differences. In Luxid 7.0.1 the corresponding Java API is unfortunately not very well documented, you’ll have to specifically ask Temis for more information. For some functionality I wanted to try out, I received an exception message stating that this particular function was not implemented yet. I don’t know about the status in Luxid 7.1, but I expect some advancements in this regard.<br />
Luxid 7.0.1 admin web app still contained an “install script” button that was apparently removed in version 7.1. The documentation guide 1.1 of the REST API does not mention this explicitly, but you actually can install and uninstall a script through this a REST API call:</p>
<figure class="highlight"><pre><code class="language-console" data-lang="console"><span class="go">curl -T MyScript.java http://server:8091/temis/v1/annotation/scripts/</span></code></pre></figure>
<p>In case you want to install a machine-learned model (.klm file) instead of a script then the REST URL is slightly different:</p>
<figure class="highlight"><pre><code class="language-console" data-lang="console"><span class="go">curl -T MyModel.klm http://server:8091/temis/v1/annotation/models/</span></code></pre></figure>
<p>If you want to write your own customized skill cartridges I highly recommend having a look at the code samples that are delivered together with every Luxid installation. In a default installation, these code samples can be found in <code>C:\Temis\Luxid7\IDE\doc</code>.</p>
<p>Writing scripts in many cases is not sufficient. Imagine that for any reason you want to open a database connection in your custom Java code. You’ll need a bunch of third-party .jar files to do so plus possibly some configuration files etc. All of these must be packaged and deployed together with your .sca file. These dependencies <em>must</em> be placed in an internal folder called <code>/plugins</code>, otherwise they can not be found from inside your Java code. Remember that every .sca is fundamentally just a zip file that contains a specific folder structure. With the tools I’ll show you further below it is actually pretty easy to write an <a href="http://ant.apache.org/">Ant</a> or <a href="http://maven.apache.org/">Maven</a> script that automates the build process for this .sca file.</p>
<p>One problem with Webstudio, SCB or admin web app is that when you push the export button you end up with a prebuilt .sca file. This file can sometimes be several hundred MB, because it may contain many different lexica. You don’t want to check in the whole file to version control but only those parts that actually have changed. As I’ve said above, every .sca file is nothing but a zip file. You can always unpack it to a folder in your file system which is part of your version control system. The version control system should then be able to see only the differences between the new and the old version for each file of the unpacked .sca.</p>
<h3 id="building-annotation-plans">Building annotation plans</h3>
<p>You can of course also build whole annotation plans in admin web app just by configuring existing annotators. An annotator in an annotation plan can be any combination of these four:</p>
<ul>
<li>a skill cartridge,</li>
<li>a Java/Groovy script or customized skill cartridge,</li>
<li>a machine-learning model plus Analytics/Analytics2/RTF cartridge,</li>
<li>another annotation plan.</li>
</ul>
<p>Once you export your annotation plan, again a .sca file is built. This file contains all dependencies inside. Of course such a file can potentially become very big, especially if your annotation plan relies on other annotation plans. (Just for the record: nope, I did not test whether you can build cyclic dependencies… .) The whole annotation plan is thus self-contained. You can import it to a backend annotation server like every regular skill cartridge.</p>
<h2 id="windows-build--deployment-utilities">Windows build & deployment utilities</h2>
<p>Assuming you really want to have full control over the build process in CES yourself you can use a set of Windows utilities available in <code>C:\Temis\Luxid7\IDE\bin\\</code> or <code>C:\Temis\Luxid7Studio\IDE\bin\\</code>. [The following information I received from a <a href="http://community.temis.com/group/temis-extranet/forum/-/message_boards/message/152221">discussion thread</a> at the internal <a href="http://community.temis.com">Temis community forum</a>.]</p>
<ul>
<li><em>scinstall.bat</em>: Installs a bare-bone skill cartridge in a CES project. Empty here means that the skill cartridge at this moment does not yet contain any vocabulary, taxonomy etc. These must be loaded using the <code>crimport.bat</code> tool.</li>
<li><em>crimport.bat</em>: Import knowledge into a Studio project, for example a vocabulary, taxonomy etc.</li>
<li><em>crassign.bat</em>: Assigns the skill cartridge to a Studio-project</li>
<li><em>crsynchro.bat</em>: Synchronize the still empty skill cartridge with the imported knowledge. This step “charges” the vocabulary or taxonomy into the skill cartridge.</li>
<li><em>crexportsca.bat</em>: Build and export the skill cartridge as a single .sca file to the file system.</li>
</ul>
<p>In Luxid 7.1 the same functionality is actually accessible through a new REST API to the Studio annotation server. (I don’t know whether 7.0.1 already provided such an API.) As with these Windows batch scripts I could not find any documentation for the Luxid 7.1 Studio REST API. If you have more documentation on both the CES 7.1 REST API or the Windows utilities, I’ll definitely be interested.</p>
<hr />
<h1 id="luxid-701-export-bug">Luxid 7.0.1 export bug</h1>
<p>Please note that in Luxid 7.0.1 there still exists a terrible bug. Whenever you try to export a configured annotation plan from admin web app while a document is currently being annotated using exactly this annotation plan, then this causes a corruption of the annotation plan. As a result subsequent attempts to annotate a document will fail and error messages will be returned - usually IOExceptions or sometimes XeldaExceptions. When you open the admin web app you will see that the annotator that was in use when the export button was pressed seemingly “disappeared” from the annotation plan. Yet, you will still see the skill cartridge being properly installed when calling the backend annotation server’s REST API:</p>
<figure class="highlight"><pre><code class="language-console" data-lang="console"><span class="go">curl http://prod-server:8091/temis/v1/annotation/cartridges.xml</span></code></pre></figure>
<p>Hence, there is a mismatch between what is displayed as a result to the REST API and what is listed in the admin web app. Restarting the server does not solve the problem. The only way to solve this issue is to actually reinstall exactly that skill cartridge that is missing in the admin web app (and then restart the server). I really hope this issue has been solved in Luxid 7.1. I know the Temis engineers were aware of this bug before rolling out Luxid 7.1, so I’m pretty positive that this was addressed although I did not check specifically so far.</p>
Embedding R In A Website2015-09-21T00:00:00+02:00http://fabian-kostadinov.github.io/2015/09/21/embedding-r-in-a-website<p>I wanted to know whether/how it is possible to embed R in a website. Looking around the internet I found a few interesting initiatives, each one dedicated to a slightly different purpose: RStudio, Shiny, Jupyter Notebook, RApache, OpenCPU and RAppArmor.<span class="more"></span></p>
<p><a href="https://www.rstudio.com/">RStudio</a> is probably very well known among R programmers. According to its website, RStudio is an integrated development environment. One of the best features of RStudio in my eyes is the ability to create both markdown and HTML pages, that contain executable R code. If you are familiar with [IPython notebook] then this is nothing new for you. In this way you can easily create protocols of your work, share them with colleagues and publish them in a webpage.<br />
RStudio comes both as a standalone desktop edition and a server edition for projects of bigger scale. Both desktop and server are available as a free open source product and as a commercially licensed product.</p>
<p><a href="http://shiny.rstudio.com/">Shiny</a> I found pretty exciting. It is essentially a web application framework that lets you build interactive web apps quickly and easily without having to have extensive web programming skills. You can either run your Shiny websites in your own web server or use the cloud hosting service provided by the vendor.<br />
Shiny is developed by the same company that sells RStudio. The open source version of Shiny <a href="https://github.com/rstudio/shiny">can be obtained through this GitHub page</a>. I believe there’s also a cloud where you can upload your R web apps.</p>
<p><a href="http://jupyter.org/">Jupyter Notebook</a> looks very impressive to me. Being essentially an extension of the already mentioned IPython Notebook (and thus requiring Python to run) you can write research reports easily, mix comments with executable code etc. Besides Python and R many more languages are supported like Scala, Julia or Haskell.</p>
<p><a href="http://rapache.net/">rApache</a> enables you to run R inside your Apache web server in a script-like fashion. rApache distributes the Apache module <em>mod_R</em> for this purpose. Therefore you have a neat integration of R with the probably most relevant open source web server. It’s not entirely clear to me how secure this is, therefore you should probably take the usual precautions before opening a webpage to the whole world that lets the users execute such code on your server.</p>
<p><a href="https://www.opencpu.org/">OpenCPU</a> is both an API plus server. On the one hand, it defines a REST-API that enables you to call R code on any server implementing this standard. As it is RESTful, you can easily create HTTP requests executing your R commands in an intuitive way. On the other hand, OpenCPU is also a server. The server can be set up on top of for example rApache or RStudio server. It handles object serialization, security, resource control and more. <a href="http://arxiv.org/abs/1406.4806">An overview on OpenCPU was published in this paper on arxiv.org</a>, there is also <a href="http://jeroenooms.github.io/opencpu-slides/#1">a slide show giving a summary</a>.</p>
<p><a href="https://cran.r-project.org/web/packages/RAppArmor/index.html">RAppArmor</a> is a [CRAN package] that aims at securing your R execution engine. Whereas the concept of sandboxing has existed for example for the Java Virtual Machine already from the beginning, the same is not the case for R. As soon as you run R on the server and open it up for the rest of the world, you of course would like to have a certain level of control over resources and execution rights. RAppArmor was built for this purpose.</p>
<p>Then, of course, there are the cloud providers.</p>
<p><a href="http://www.r-fiddle.org/">R-Fiddle</a> is an online R. The really cool thing is that you can write some code and then embed it in your own webpage in a <a href="http://www.w3schools.com/tags/tag_iframe.asp">HTML iframe</a>. This is maybe the fastest and easiest way of embedding R in a webpage. Of course there are limitations to what can be done, but for not too complicated examples this should work just fine. Also see <a href="http://www.r-bloggers.com/r-fiddle-an-online-playground-for-r-code/">this blog post on R-Fiddle</a>.</p>
<p>A website similar to R-Fiddle, that however requires setting up a username & password, is <a href="https://www.getdatajoy.com/">DataJoy</a>. You can write online scripts with R or Python.</p>
<p><a href="http://www.sagemath.com/">SageMath</a> apparently is similar to DataJoy. You’ll need to <a href="https://cloud.sagemath.com/">register with your name, create a username and password</a> before you’re able to write R code in your browser and run it in the remote cloud.</p>
<p>A fourth alternative is <a href="http://www.tutorialspoint.com/codingground.htm">provided by Tutorials Point</a>. However, I did not easily spot how to create plots, and typing commands in the shell seemed to be rather slow.</p>
<p>Of course there’s even more going on with projects like <a href="http://rpy.sourceforge.net/rpy2.html">rpy2</a> or <a href="http://dirk.eddelbuettel.com/blog/2014/10/23/">Rocker</a>, but most of them do not explicitly target integration with the web. Also of interest is perhaps <a href="https://github.com/sckott/analogsea">Analog Sea</a> for <a href="https://www.digitalocean.com/">Digital Ocean</a>. Analog Sea is freeware, however Digital Ocean is not. Digital Ocean is a commercial online computing cloud. You can create and destroy so called “droplets” (remote virtual computers) that execute your code. It’s thus somewhat similar to the Amazon Cloud. Then there are free online courses to learn R offered by <a href="https://www.datacamp.com/">DataCamp</a> or <a href="https://www.codeschool.com/">Code School</a>.</p>
Temis Luxid 7.0.1 Webstudio Tutorial2015-09-02T00:00:00+02:00http://fabian-kostadinov.github.io/2015/09/02/temis-luxid701-webstudio-tutorial<p>As I was not able to find any tutorials on the web on how to use Temis Luxid 7.0.1 Webstudio, I simply decided to write my own. Luxid Webstudio is a tool that is intended for different use cases. One thing it does very well is to assist a taxonomy expert to build a new taxonomy or enrich an existing one with new terms. Furthermore, once a taxonomy is created it can be “plugged in” to the STF skill cartridge, which then is able to extract all the taxonomy terms from documents. By exporting this customized skill cartridge from Webstudio, you can simply deploy it to a dedicated annotation server running in your production environment. Some of Webstudio’s functionality overlaps with the Eclipse-RCP based Luxid 7.0.1 <em>Annotation Workbench</em>, however Webstudio is simply more comfortable to use. Only in some cases it is necessary to switch to Annotation Workbench because it exposes even more functionality to the user than Webstudio does.<span class="more"></span></p>
<p>The work performed with Webstudio is organized in projects, and each project is stored inside the same MySQL database. Other Luxid tools like Annotation Workbench or Rich Annotation Pad (which is nothing else than a subset of Annotation Workbench) access the same database and therefore you can work on the same data and projects although the tools at their surface look quite different.</p>
<p>In this tutorial you’ll learn…</p>
<ul>
<li>how to import an existing taxonomy,</li>
<li>enrich it with new terms,</li>
<li>and finally export and deploy it to a production server.</li>
</ul>
<p><strong>Step 1</strong><br />
Log in to Luxid 7.0.1 Webstudio. By default, Webstudio is available at http://dev-server:8060/LuxidStudio, where <em>dev-server</em> of course refers to the server where Luxid Content Enrichment Studio is stored.</p>
<p><strong>Step 2</strong><br />
Create a new project, chose any name you want. In the background, projects are actually stored in a MySQL database.</p>
<p><strong>Step 3</strong><br />
Usually, when working with Webstudio, there are two steps to be taken.</p>
<ol>
<li>Install the skill cartridge you want to use in your project.</li>
<li>Upload sample documents you want to work with.</li>
</ol>
<p><strong>Step 4</strong><br />
In this tutorial we’ll assume that you will use the STF (Suggested Term Finder) skill cartridge. The STF skill cartridge (together with the IncreaseSTFPrecision cartridge) is already installed by default in every new Luxid 7.0.1 Webstudio project. When you go to <em>> Project > Configure… > Annotation</em> you actually see the annotation plan with all annotators contained therein.</p>
<p>The <em>STF</em> annotator is aims at finding terms already contained in a taxonomy in a corpus of documents.</p>
<p>The <em>IncreaseSTFPrecision</em> annotator takes the STF’s output and removes certain extracted terms from it. Assume there is a taxonomy concerning animals. This taxonomy includes the terms “cat” and “dog”. Further assume that there is a document containing the sentence “It’s raining cats and dogs.” Then surely “raining” indicates that semantically this sentence is referring to a weather phenomenon - not to animals. It’s a metaphor, in other words. We want to prevent “cats” and “dogs” being extracted when they appear together with the word “raining”. Every term in our taxonomy has two special attributes that can be used in such situations: <em>Forbidden Spans</em> and <em>Context-Based Stop List</em>. In this case we could use the latter and add “raining” to our list of stop words. This means that whenever the word “raining” appears in the same context (sentence) as “cats” or “dogs” then these terms will actually not be extracted as animals. This is the IncreaseSTFPrecision’s task.</p>
<p>Once we will be done with building our taxonomy, we can either export the taxonomy in a <a href="http://www.w3.org/2004/02/skos/">SKOS format</a> or build and deploy a new skill cartridge that combines the taxonomy, the STF and IncreaseSTFPrecision skill cartridges and deploy them to our backend annotation server.</p>
<p>If you did not want to use the STF (and IncreaseSTFPrecision) skill cartridge but for example the TM360 or WPS (Wikipedia Synonyms) cartridge, then you’d have to uninstall the default cartridges first and then install the TM360 or WPS. (As a side note, I ran into problems when trying to work with the TM360 inside Luxid 7.0.1 Webstudio. My hope is that the already announced version 7.1 will solve these.)</p>
<p><strong>Step 5</strong><br />
There is a bug in Luxid 7.0.1 Webstudio which causes problems when searching for taxonomy terms in documents. To circumvent it, go to <em>> Project > Configure… > Languages</em> and then make sure that the language <em>– Unspecified –</em> is selected.</p>
<p><strong>Step 6</strong><br />
Now that we’ve installed our skill cartridge, we must upload a corpus of sample documents. I don’t recommend starting with too many documents in the beginning. Be aware that the import process may take some time as each document is first converted from its original format into an internal HTML format and further pre-annotated with certain metadata. The converted documents are then stored inside the MySQL database and all processing is performed on this internal data. This is however true only for the Content Enrichment Studio (CES) tools. The annotation server that is included in the CES tools operates on the documents in the MySQL database. The annotation server intended for productive use does not, one must either push documents to it through a REST API or tell it to pull documents from a certain location. In both cases it only processes the documents but does not store them anywhere.</p>
<p>By the way, in Webstudio deleting a document from the imported corpus is not possible, this is one of the features currently only available in the Annotation Workbench.</p>
<p>Many more than only the following file formats are supported, however be aware that the quality of the extraction results may vary according to document types:</p>
<ul>
<li>Word (doc, docx)</li>
<li>PDF</li>
<li>Text</li>
<li>HTML, XML</li>
<li>Powerpoint (ppt, pptx)</li>
<li>Excel (xls, xlsx)</li>
<li>Many more</li>
</ul>
<p>A few remarks:</p>
<ul>
<li>What is processed is <em>text content</em> from a file. If the file does not contain text or only little then of course nothing can be extracted. I have seen Word and PDF files containing nothing but images (for example screenshots). Luxid is not an optical character recognition (OCR) system. It cannot process data from images, audio or video files. This should be self-evident but especially when you’re dealing with data that was uploaded by users in an uncontrolled way to a shared repository such as Microsoft Sharepoint you will almost certainly encounter such documents.</li>
<li>Documents that do not contain any text at all (for example .dll or .exe files) are simply ignored by Luxid.</li>
<li>According to my experience extracting data from Excel files is much more difficult than from Word, PDF or Text files. Excel data is fundamentally table-based, it’s sorted in rows and columns. This is very different from common text which is fundamentally a single long string of words and characters.</li>
<li>When processing HTML or XML files it is recommended to use the <em>LuxidConverter</em> skill cartridge first in any annotation plan. This does not refer to Webstudio but to the annotation plans installed in the backend annotation server.</li>
</ul>
<p>So, let’s upload our sample documents. Go to <em>> Project > Add Documents…</em> and select the file(s) you want to upload. A nice feature, which is unfortunately not mentioned explicitly, is that you can upload a whole zip file containing several text files.</p>
<p><strong>Step 7</strong><br />
When working with the STF skill cartridge, most of the time we also work with a taxonomy. In case you already have an existing taxonomy that you want to enrich with new terms, you first need to import this taxonomy into Webstudio. On the left-hand side below the Luxid Webstudio logo, you’ll find an empty <em>Thesaurus</em>. If you click on the icon with the three horizontal bars, you can select <em>Import Thesaurus…</em>. The taxonomy should usually be in a SKOS format, but a few different file formats can be imported in the Annotation Workbench. This is an exmample for a very simply taxonomy in the SKOS format consisting of only three terms: animal, which will not be extracted (<code><luxid:DoNotExtract>true</luxid:DoNotExtract></code>) and a term “Dog” and “Cat” which are subterms of “Animal”.</p>
<figure class="highlight"><pre><code class="language-xml" data-lang="xml"><span class="cp"><?xml version="1.0" encoding="UTF-8"?></span>
<span class="nt"><rdf:RDF</span>
<span class="na">xmlns:skos=</span><span class="s">"http://www.w3.org/2004/02/skos/core#"</span>
<span class="na">xmlns:luxid=</span><span class="s">"http://www.temis.com/luxid#"</span>
<span class="na">xmlns:rdf=</span><span class="s">"http://www.w3.org/1999/02/22-rdf-syntax-ns#"</span>
<span class="na">xmlns:my-animals=</span><span class="s">"http://www.example.com/animal-taxonomy#"</span><span class="nt">></span>
<span class="c"><!-- Animal --></span>
<span class="nt"><rdf:Description</span> <span class="na">rdf:about=</span><span class="s">"http://www.example.com/animal-taxonomy#Animal"</span><span class="nt">></span>
<span class="nt"><rdf:type</span> <span class="na">rdf:resource=</span><span class="s">"http://www.w3.org/2004/02/skos/core#Concept"</span><span class="nt">/></span>
<span class="nt"><luxid:DoNotExtract></span>true<span class="nt"></luxid:DoNotExtract></span>
<span class="nt"><skos:prefLabel</span> <span class="na">xml:lang=</span><span class="s">"en"</span><span class="nt">></span>Animal<span class="nt"></skos:prefLabel></span>
<span class="nt"><skos:prefLabel</span> <span class="na">xml:lang=</span><span class="s">"de"</span><span class="nt">></span>Tier<span class="nt"></skos:prefLabel></span>
<span class="nt"></rdf:Description></span>
<span class="c"><!-- Dog --></span>
<span class="nt"><rdf:Description</span> <span class="na">rdf:about=</span><span class="s">"http://www.example.com/animal-taxonomy#Dog"</span><span class="nt">></span>
<span class="nt"><rdf:type</span> <span class="na">rdf:resource=</span><span class="s">"http://www.w3.org/2004/02/skos/core#Concept"</span><span class="nt">/></span>
<span class="nt"><skos:broader</span> <span class="na">rdf:resource=</span><span class="s">"http://www.example.com/animal-taxonomy#Animal"</span><span class="nt">/></span>
<span class="nt"><skos:prefLabel</span> <span class="na">xml:lang=</span><span class="s">"en"</span><span class="nt">></span>Dog<span class="nt"></skos:prefLabel></span>
<span class="nt"><skos:prefLabel</span> <span class="na">xml:lang=</span><span class="s">"fr"</span><span class="nt">></span>Chien<span class="nt"></skos:prefLabel></span>
<span class="nt"><my-animals:database-id></span>1234<span class="nt"></my-animals:database-id></span>
<span class="nt"></rdf:Description></span>
<span class="c"><!-- Cat --></span>
<span class="nt"><rdf:Description</span> <span class="na">rdf:about=</span><span class="s">"http://www.example.com/animal-taxonomy#Cat"</span><span class="nt">></span>
<span class="nt"><rdf:type</span> <span class="na">rdf:resource=</span><span class="s">"http://www.w3.org/2004/02/skos/core#Concept"</span><span class="nt">/></span>
<span class="nt"><skos:broader</span> <span class="na">rdf:resource=</span><span class="s">"http://www.example.com/animal-taxonomy#Cat"</span><span class="nt">/></span>
<span class="nt"><skos:prefLabel</span> <span class="na">xml:lang=</span><span class="s">"en"</span><span class="nt">></span>Cat<span class="nt"></skos:prefLabel></span>
<span class="nt"><skos:prefLabel</span> <span class="na">xml:lang=</span><span class="s">"de"</span><span class="nt">></span>Katze<span class="nt"></skos:prefLabel></span>
<span class="nt"><my-animals:database-id></span>1235<span class="nt"></my-animals:database-id></span>
<span class="nt"></rdf:Description></span>
<span class="nt"></rdf:RDF></span></code></pre></figure>
<p>Once imported, you can use the taxonomy in Webstudio.</p>
<p>There exists a nice, free taxonomy by the name of <em>STW (Standard-Thesaurus Wirtschaft)</em> primarily containing economics terms published and maintained by the <a href="http://zbw.eu/en/">ZBW German National Library of Economics</a> that you can use right out of the box. This taxonomy is in the RDF format which neatly integrates with SKOS. It can be found at <a href="http://zbw.eu/stw/versions/latest/download/">http://zbw.eu/stw/versions/latest/download/</a>.</p>
<p>There are different ways to interact with the taxonomy. A simple task is to add new terms anywhere in the hierarchy or delete existing ones by clicking on the plus and minus sign next to any term. A very clever feature is to receive suggestions for new terms not yet part of the taxonomy. We’ll do this further below.</p>
<p><strong>Step 8</strong><br />
Open a term by clicking on it in the Thesaurus section. Every term in the taxonomy contains a few attributes or fields.</p>
<p><em>General Attributes</em></p>
<ul>
<li><em>ID</em>: The ID of this term. This is not a database ID but rather a RDF-style ID, for example “http://www.example.com/animal-taxonomy” or “http://www.foo.com/cities/Mumbai”.</li>
<li><em>Preferred Labels</em>: The main or preferred label of this term, for example “Mumbai”. The STF skill cartridge will try to extract this text string from documents and indicate it has found a “Mumbai” term.</li>
<li><em>Alternative Labels</em>: Alternative labels of the term, e.g. “Bombay”. The STF skill cartridge will try to extract this text string from documents also. If found it will however indicate a “Mumbai” term.</li>
<li><em>Hidden Labels</em>: Labels of the term that are explicitly not extracted.</li>
</ul>
<p><em>Extraction</em></p>
<ul>
<li><em>Do Not Extract</em>: Set to true if for any reason you don’t want to extract this term.</li>
<li><em>Extraction Method</em>: <em>Fuzzy Matching</em> uses a, well, fuzzy string matching approach to match a term. This may still extract terms with misspellings. <em>Same Form</em> extracts terms that have an exact (case-sensitive or insensitive) match of characters.</li>
<li><em>Forbidden Spans</em>: Used by IncreaseSTFPrecision annotator.</li>
<li><em>Context-Based Stop List</em>: Used by IncreaseSTFPrecision annotator.</li>
</ul>
<p><em>Relationships</em></p>
<ul>
<li><em>Broader Concepts</em>: Concepts higher up in the taxonomy’s hierarchy.</li>
<li><em>Narrower Concepts</em>: Concepts further below in the taxonomy’s hierarchy.</li>
</ul>
<p><em>Notes</em></p>
<ul>
<li><em>Definition</em>: A field where you can enter a free text definition for this particular term or concept.</li>
<li><em>Scope note</em>: A field where you can enter additional notes concerning the scope of this particular term or concept.</li>
<li><em>Notation</em>: See <a href="http://www.w3.org/TR/skos-reference/#notations">SKOS definition</a> for more details.</li>
</ul>
<p><em>Custom Attributes</em></p>
<ul>
<li><em>Custom attributes</em>: Attributes already contained in the taxonomy that have no meaning to Webstudio may be stored as custom attributes.</li>
</ul>
<p><strong>Step 9</strong><br />
By clicking on the small eye icon (“Preview sentences potentially containing this concept”) the corpus of text documents is searched for occurrences of this term. Depending on the corpus and taxonomy size this may take a moment. The results are displayed in the <em>Preview</em> section. In this way you can check the quality of the extracted results and make corrections to your taxonomy.</p>
<p>Click on the <em>alternatives icon</em> next to the eye icon to receive a list of synonym candidates for this term. This is very useful to find alternative labels or possibly new terms.</p>
<p><strong>Step 10</strong><br />
You can receive suggestions for new terms not yet contained in the taxonomy (and not being synonyms neither). Click on the <em>lamp icon</em> next to the <em>Thesaurus</em> label. In the suggestions panel you can see a ranked list of terms that are potentially relevant to you and are not yet part of the taxonomy. There are different sorts of ranking, and you can adapt them according to your need. Drag and drop suggested terms simply from this list to your taxonomy.</p>
<p><strong>Step 11</strong><br />
When you are done with creating/enriching the taxonomy, you can actually export again to a file. Click on the icon with the three little horizontal bars close to the term <em>Thesaurus</em>, then select <em>> Export Thesaurus…</em>.</p>
<p><strong>Step 12</strong><br />
We are ready to export our taxonomy & STF skill cartridge to deploy it on the production annotation server. Click on <em>> Projects > Export Annotation Plan…</em>. In the background Webstudio builds a <em>skill cartridge archive (.sca)</em> file that is by default stored in your personal <em>Download</em> folder. Every .sca file is actually nothing else than a zip file. If you want you can rename it to .zip and open it with WinZip or a similar tool. This .sca file is a fully self-contained annotation plan. Internally it contains:</p>
<ol>
<li>its own copy of the taxonomy (in binary format),</li>
<li>its own copy of the STF skill annotator,</li>
<li>its own copy of the IncreaseSTFPrecision annotator.</li>
</ol>
<p>Being fully-self contained means that on the production server it will not have any cross-dependencies to other skill cartridges already installed there. On the positive side .sca files can easily be deployed and undeployed to and from the production annotation server. On the negative side they may become very big. This is especially true when working with the TM360 skill cartridge.</p>
<p><strong>Step 13</strong><br />
Log in to the production annotation server. There is usually an admin web app available at http://prod-server:8061/Luxid7Admin. You’ll need an administrator username and password. Once logged in you simply can select “Install Skill Cartridge” and select the .sca file that you’ve just exported.</p>
<p><strong>Step 14</strong><br />
Now that you’ve installed it, you can send documents to the server’s REST API to be annotated. The following cURL command can also be found in the REST API guide distributed together with the backend annotation server. We’ll assume that you just installed an annotation plan from a file named “MyAnimals.sca”.</p>
<p>Example 1: Sending a simple text string to the server:</p>
<figure class="highlight"><pre><code class="language-console" data-lang="console"><span class="go">curl -X POST -H "Content-Type: text/plain" -d "Hello cat and hello dog! What sort of animal are you?" http://prod-server:8091/temis/v1/annotation/annotate/MyAnimals</span></code></pre></figure>
<p>Example 2: Sending a PDF document to the server:</p>
<figure class="highlight"><pre><code class="language-console" data-lang="console"><span class="go">curl -X POST -H "Content-Type: application/octet-stream" --data-binary "@C:\myDocument.pdf" http://prod-server:8091/temis/v1/annotation/annotate/MyAnimals</span></code></pre></figure>
<p>Depending on the document’s contents you should receive a lengthy XML output in the LUX format. LUX is a XML format specification invented by Temis. The corresponding documentation is also distributed together with the REST API and the production annotation server. The LUX output contains:</p>
<ul>
<li>The raw text as it has been transformed from the original input document,</li>
<li>a list of sentence begin and end positions,</li>
<li>the extracted entities such as “/Thesaurus/Concept/Animal/Dog”,</li>
<li>their occurrences in the text,</li>
<li>eventually further attributes for each extracted entity.</li>
</ul>
<p>If you only receive minimal or no output at this stage, you’ll probably need to understand more about the concept of <em>static mapping</em>. The concept is not very difficult to understand, however changing static mapping rules in Webstudio is unfortunately not possible, and is rather complicated in Annotation Workbench. The basic idea is that every term in the taxonomy’s hierarchy is “bound” to a specific path inside the hierarchy, for example “/Thesaurus/Concept/Animal/Dog”. If you built your taxonomy in Webstudio your taxonomy will probably not contain a term “Concept” and for example the term for “Dog” will be placed in “/Thesaurus/Animals/Dog”. However, at the same time in <em>> Project > Configure… > Annotation</em> in the section <em>Annotation Strategies</em> the default settings are that all terms will be bound to “/Thesaurus/Concept” and not simply “/Thesaurus”. There are only two possibilities here: Either you change this mapping inside Annotation Workbench manually, or you accept your fate, create a new root term “/Thesaurus/Concept” and move your whole taxonomy under that term. Of course you need again to export everything as a .sca file and re-import it to the production annotation server.</p>
<p><strong>Step 15</strong><br />
This output could now be processed further by a service client, for example to highlight all occurrences of animal names in an original text.</p>
Implementation of Rhizomes - Part 22015-08-18T00:00:00+02:00http://fabian-kostadinov.github.io/2015/08/18/implementation-of-rhizomes-2<p><a href="/2014/09/09/implementation-of-rhizomes/">In my last post on the implementation of rhizomes</a> I still suggested using hash maps to store pairings, that is relations. It was just recently that I recognized that there is of course an even simpler and more concise way of storing relations: as a single, long bit string, where a bit is set to 1 if a relation is established.<span class="more"></span></p>
<p>Let us assume for the moment that we use the Cantor pairing function already defined earlier:</p>
<p><strong>Cantor pairing function</strong>:</p>
<figure class="highlight"><pre><code class="language-js" data-lang="js"><span class="nx">pair</span><span class="p">(</span><span class="nx">x</span><span class="p">,</span> <span class="nx">y</span><span class="p">)</span> <span class="o">=</span> <span class="nx">z</span> <span class="o">=</span> <span class="mi">1</span><span class="o">/</span><span class="mi">2</span> <span class="o">*</span> <span class="p">(</span><span class="nx">x</span><span class="o">^</span><span class="mi">2</span> <span class="o">+</span> <span class="mi">3</span><span class="nx">x</span> <span class="o">+</span> <span class="mi">2</span><span class="nx">xy</span> <span class="o">+</span> <span class="nx">y</span> <span class="o">+</span> <span class="nx">y</span><span class="o">^</span><span class="mi">2</span><span class="p">)</span>
<span class="nx">unpair</span><span class="p">(</span><span class="nx">z</span><span class="p">)</span> <span class="o">=</span> <span class="p">(</span><span class="nx">x</span><span class="p">,</span> <span class="nx">y</span><span class="p">)</span> <span class="o">=</span> <span class="p">{</span>
<span class="nx">x</span> <span class="o">=</span> <span class="nx">z</span> <span class="o">-</span> <span class="p">(</span><span class="nx">q</span> <span class="o">*</span> <span class="p">(</span><span class="mi">1</span> <span class="o">+</span> <span class="nx">q</span><span class="p">))</span><span class="o">/</span><span class="mi">2</span><span class="p">,</span>
<span class="nx">y</span> <span class="o">=</span> <span class="p">(</span><span class="nx">q</span> <span class="o">*</span> <span class="p">(</span><span class="mi">3</span> <span class="o">+</span> <span class="nx">q</span><span class="p">))</span><span class="o">/</span><span class="mi">2</span> <span class="o">-</span> <span class="nx">z</span> <span class="p">},</span>
<span class="kd">with</span> <span class="nx">q</span> <span class="o">=</span> <span class="nx">floor</span><span class="p">((</span><span class="o">-</span><span class="mi">1</span> <span class="o">+</span> <span class="nx">sqrt</span><span class="p">(</span><span class="mi">1</span> <span class="o">+</span> <span class="mi">8</span><span class="nx">z</span><span class="p">))</span><span class="o">/</span><span class="mi">2</span><span class="p">)</span></code></pre></figure>
<p>For the relation <code>(1, 3)</code> the output will be <code>11</code>. The only thing we need to do to indicate that there exists such a relation is set the eleventh bit to <code>1</code>. To save some disk space, we additionally use <a href="http://arxiv.org/pdf/1402.6407.pdf">bit packing such as <em>Concise</em> or <em>Word Aligned Hybrid (WAH)</em></a> [1]. It is however highly recommended to use a bit packing algorithm that allows to check whether a bit is set without having to unpack and re-pack the string.
This representation results in an extremely storage-efficient representation. Of course we still require a separate data structure such as a bi-directional hash map to store the relation between leaves and the real data.</p>
<p>Now that we got rid of hash maps, how do we efficiently look up all the relata for a given relator? Assume that you have a relation (x, ?). How can we efficiently find out all possible values for <code>?</code> without having to loop over all bits in the bit string?</p>
<p>Luckily, the Cantor pairing function (and as it turns out also the Szudzik pairing function) have nice symmetries concerning the values stored in rows and also in columns. Take a look at the following figure:</p>
<p><img src="/public/img/2015-08-18-cantor-pairing-func.png" alt="Cantor pairing function example" title="Cantor pairing function example" /></p>
<p>To find out all the pairings for the relation (3, ?) the only bits we need to check are the ones in the third row, that is the values 9, 13, 18, 24, 31, 29, 48… . Finding the regularity behind this row is very simple.</p>
<ol>
<li>Calculate the first pairing (3, 0) => 9.</li>
<li>3 is the fourth number (starting at 0). Add 4 to the result from the last step: 9 + 4 = 13. Thus, 13 is the next relation (3, 1) => 13.</li>
<li>Add 4 + 1 to the result from the last step: 13 + (4 + 1) = 18. Thus, (3, 2) => 18.</li>
<li>Add 4 + 2 to the result from the last step: 18 + (4 + 2) = 24. Thus, (3, 3) => 24.</li>
<li>24 + (4 + 3) = 31. Thus, (3, 4) => 31.
Et cetera.</li>
</ol>
<p>If the complete bit string has a length of 10000 then we simply need to check the bits 9, 13, 18, 24, 31, 29, 48… up to and including the 9999<sup>th</sup> bit to find all the pairings (3, ?). Of course this could be parallelized easily too.</p>
<p>For the Szudzik pairing function, the situation is only slightly more complicated. There, we need to make a distinction between values below the diagonale and those above it. Other than that, the same principles apply. One nice feature about using the Szudzik pairing function is that all values below the diagonale are actually subsequent numbers.</p>
<hr />
<h1 id="references">References</h1>
<p>[1] Chambi S., Lemire D., Kaser O., Godin R. (2015): Better bitmap performance with Roaring bitmaps. Working paper published at <a href="http://arxiv.org/abs/1402.6407">http://arxiv.org/abs/1402.6407</a>.</p>
Principal Component Analysis in R2015-05-31T00:00:00+02:00http://fabian-kostadinov.github.io/2015/05/31/pca-in-r<p>I wanted to know a little more on <em>Principal Component Analysis (PCA)</em> in R. For this purpose, I first created my own artificial dataset. I wanted to reuse the same dataset later on for performing also cluster analysis, so I put a little bit of thought in how to create it.<span class="more"></span></p>
<p>This is the R code I used.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">Classes</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sample</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="m">100</span><span class="p">,</span><span class="w"> </span><span class="n">replace</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span><span class="n">createData</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">class</span><span class="p">,</span><span class="w"> </span><span class="n">means1</span><span class="p">,</span><span class="w"> </span><span class="n">sd1</span><span class="p">,</span><span class="w"> </span><span class="n">means2</span><span class="p">,</span><span class="w"> </span><span class="n">sd2</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">n</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">class</span><span class="p">)</span><span class="w">
</span><span class="n">X1</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">means1</span><span class="p">[</span><span class="n">class</span><span class="p">],</span><span class="w"> </span><span class="n">sd1</span><span class="p">)</span><span class="w">
</span><span class="n">X2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">means2</span><span class="p">[</span><span class="n">class</span><span class="p">],</span><span class="w"> </span><span class="n">sd2</span><span class="p">)</span><span class="w">
</span><span class="n">X3</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sample</span><span class="p">(</span><span class="m">-100</span><span class="o">:</span><span class="m">100</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">replace</span><span class="o">=</span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span><span class="n">X4</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">X1</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">X1</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">runif</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="m">-0.8</span><span class="p">,</span><span class="w"> </span><span class="m">0.8</span><span class="p">)</span><span class="w">
</span><span class="n">X5</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="m">-18</span><span class="p">,</span><span class="w"> </span><span class="m">1.2</span><span class="p">)</span><span class="w">
</span><span class="n">m</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="n">X3</span><span class="p">,</span><span class="w"> </span><span class="n">X1</span><span class="p">,</span><span class="w"> </span><span class="n">X4</span><span class="p">,</span><span class="w"> </span><span class="n">X2</span><span class="p">,</span><span class="w"> </span><span class="n">X5</span><span class="p">),</span><span class="w"> </span><span class="n">nrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">X1</span><span class="p">),</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">5</span><span class="p">)</span><span class="w">
</span><span class="n">colnames</span><span class="p">(</span><span class="n">m</span><span class="p">)</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"X3"</span><span class="p">,</span><span class="w"> </span><span class="s2">"X1"</span><span class="p">,</span><span class="w"> </span><span class="s2">"X4"</span><span class="p">,</span><span class="w"> </span><span class="s2">"X2"</span><span class="p">,</span><span class="w"> </span><span class="s2">"X5"</span><span class="p">)</span><span class="w">
</span><span class="nf">return</span><span class="p">(</span><span class="n">m</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">createData</span><span class="p">(</span><span class="n">Classes</span><span class="p">,</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="m">50</span><span class="p">,</span><span class="w"> </span><span class="m">100</span><span class="p">),</span><span class="w"> </span><span class="m">5</span><span class="p">,</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0.3</span><span class="p">,</span><span class="w"> </span><span class="m">0.5</span><span class="p">,</span><span class="w"> </span><span class="m">1.0</span><span class="p">),</span><span class="w"> </span><span class="m">0.1</span><span class="p">)</span></code></pre></figure>
<p>Let’s go through this step by step.</p>
<p>First, I created a vector of length 100 containing values 1, 2 or 3: <code>Classes <- sample(1:3, 100, replace = TRUE)</code>. The value would indicate later whether a certain observation in my dataset belongs to one of three different clusters of data.</p>
<p>Next, I created a function that can be used to create a dataset automatically. The created dataset has five different variables X1 to X5. X3 is a uniformly distributed random variable in the range of -100 to 100: <code>X3 <- sample(-100:100, n, replace=TRUE)</code>. X5 is normally distributed random variable with mean -18 and a standard deviation of 1.2: X5 <- rnorm(n, -18, 1.2). In other words, X3 and X5 are totally unrelated to all other variables. In contrast, X1, X2 and X4 are all correlated to each other. An observation is placed in one of three different clusters of observations along three dimensions denoted as X1, X2 and X4. X1 has three cluster centers at 3, 50 and 100 and a standard deviation of 5: <code>X1 <- rnorm(n, means1[class], sd1)</code>; X2 has three cluster centers at 0.3, 0.5 and 1.0 and a standard deviation of 0.1: <code>X2 <- rnorm(n, means2[class], sd2)</code>; X4 is derived from X1: <code>X4 <- X1 + X1 * runif(n, -0.8, 0.8)</code>. Note that I reordered all variables somewhat to make our task a little juicier.</p>
<p>Here’s the correlation matrix and the correlogram.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">cor</span><span class="p">(</span><span class="n">data</span><span class="p">)</span><span class="w">
</span><span class="n">pairs</span><span class="p">(</span><span class="n">data</span><span class="p">)</span></code></pre></figure>
<table>
<tr>
<td></td>
<td>X3</td>
<td>X1</td>
<td>X4</td>
<td>X2</td>
<td>X5</td>
</tr>
<tr>
<td>X3</td>
<td>1.00000000</td>
<td>-0.1133516</td>
<td>-0.08752043</td>
<td>-0.09320869</td>
<td>-0.02701275</td>
</tr>
<tr>
<td>X1</td>
<td>-0.11335158</td>
<td>1.0000000</td>
<td>0.83200505</td>
<td>0.90965123</td>
<td>-0.14800759</td>
</tr>
<tr>
<td>X4</td>
<td>-0.08752043</td>
<td>0.8320051</td>
<td>1.00000000</td>
<td>0.75789451</td>
<td>-0.15882644</td>
</tr>
<tr>
<td>X2</td>
<td>-0.09320869</td>
<td>0.9096512</td>
<td>0.75789451</td>
<td>1.00000000</td>
<td>-0.13261305</td>
</tr>
<tr>
<td>X5</td>
<td>-0.02701275</td>
<td>-0.1480076</td>
<td>-0.15882644</td>
<td>-0.13261305</td>
<td>1.00000000</td>
</tr>
</table>
<p><img src="/public/img/2015-05-31-correlogram.png" alt="Correlogram" title="Correlogram" /></p>
<p>It should already be obvious from both the correlation matrix and the correlogram that there is a strong connection between X1, X2 and X4, whereas X3 and X5 are unrelated to the rest. Let’s see whether we can confirm this with a PCA. We use the <code>princomp</code> function with the default parameters.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">pc</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">princomp</span><span class="p">(</span><span class="n">data</span><span class="p">)</span><span class="w">
</span><span class="n">summary</span><span class="p">(</span><span class="n">pc</span><span class="p">)</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="n">pc</span><span class="p">)</span></code></pre></figure>
<table>
<tr>
<th>Importance of components:</th>
<th>Comp.1</th>
<th>Comp.2</th>
<th>Comp.3</th>
<th>Comp.4</th>
<th>Comp.5</th>
</tr>
<tr>
<td>Standard deviation</td>
<td>67.5243440</td>
<td>60.8182793</td>
<td>18.54506263</td>
<td>1.2692048845</td>
<td>1.216295e-01</td>
</tr>
<tr>
<td>Proportion of Variance</td>
<td>0.5299356</td>
<td>0.4299031</td>
<td>0.03997229</td>
<td>0.0001872259</td>
<td>1.719413e-06</td>
</tr>
<tr>
<td>Cumulative Proportion</td>
<td>0.5299356</td>
<td>0.9598388</td>
<td>0.99981105</td>
<td>0.9999982806</td>
<td>1.000000e+00</td>
</tr>
</table>
<p>The table shows that already component 1 and 2 together capture 96% of the data’s variance. The screeplot confirms this.</p>
<p><img src="/public/img/2015-05-31-screeplot.png" alt="Screeplot" title="Screeplot" /></p>
<p>To find out which components match to which variables, we’ll use the <code>loadings</code> function.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">loadings</span><span class="p">(</span><span class="n">pc</span><span class="p">)</span></code></pre></figure>
<table>
<tr>
<th>Loadings</th>
<th>Comp.1</th>
<th>Comp.2</th>
<th>Comp.3</th>
<th>Comp.4</th>
<th>Comp.5</th>
</tr>
<tr>
<td>X3</td>
<td>0.596</td>
<td>0.803</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>X1</td>
<td>-0.459</td>
<td>0.322</td>
<td>0.828</td>
<td></td>
<td></td>
<tr>
<tr>
<td>X4</td>
<td>-0.659</td>
<td>0.501</td>
<td>-0.560</td>
<td></td>
<td></td>
</tr>
<tr>
<td>X2</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>1.000</td>
</tr>
<tr>
<td>X5</td>
<td></td>
<td></td>
<td></td>
<td>1.000</td>
<td></td>
</tr>
</table>
<table>
<tr>
<th></th>
<th>Comp.1</th>
<th>Comp.2</th>
<th>Comp.3</th>
<th>Comp.4</th>
<th>Comp.5</th>
</tr>
<tr>
<td>SS loadings</td>
<td>1.0</td>
<td>1.0</td>
<td>1.0</td>
<td>1.0</td>
<td>1.0</td>
</tr>
<tr>
<td>Proporation Var</td>
<td>0.2</td>
<td>0.2</td>
<td>0.2</td>
<td>0.2</td>
<td>0.2</td>
</tr>
</tr>
<tr>
<td>Cumulative Var</td>
<td>0.2</td>
<td>0.4</td>
<td>0.6</td>
<td>0.8</td>
<td>1.0</td>
</tr>
</table>
Component 4 and 5 map to variables X2 and X5. They are clearly independent from the rest. Component 1, 2 and 3 map to variables X3, X1 and X4.
Wait a second. Does this make sense? Something is fishy here. From how we constructed the sample dataset, we would actually expect variables X1, X2 and X4 to belong together - not X1, X3 and X4! Did we maybe make a mistake somewhere? Unfortunately we did. <code>princomp</code> by default uses the variance matrix - not the correlation matrix - to compute the different components. Variances unlike correlations, however, depend strongly on the scale and units of the input data. This means, either __[we must scale our input data prior to performing a principal component analysis or we must base our analysis on the correlations, not the variances, between our variables](http://stats.stackexchange.com/questions/53/pca-on-correlation-or-covariance)__.
Let's quickly have a look at the variance matrix.
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">var</span><span class="p">(</span><span class="n">data</span><span class="p">)</span></code></pre></figure>
<table>
<tbody>
<tr>
<td></td>
<td>X3</td>
<td>X1</td>
<td>X4</td>
<td>X2</td>
<td>X5</td>
</tr>
<tr>
<td>X3</td>
<td>4043.545354</td>
<td>-287.916544</td>
<td>-307.35406</td>
<td>-1.74489036</td>
<td>-2.22236234</td>
</tr>
<tr>
<td>X1</td>
<td>-287.916544</td>
<td>1595.569658</td>
<td>1835.40627</td>
<td>10.69703439</td>
<td>-7.64903703</td>
</tr>
<tr>
<td>X4</td>
<td>-307.354064</td>
<td>1835.406265</td>
<td>3049.97811</td>
<td>12.32218479</td>
<td>-11.34843658</td>
</tr>
<tr>
<td>X2</td>
<td>-1.744890</td>
<td>10.697034</td>
<td>12.32218</td>
<td>0.08666848</td>
<td>-0.05051051</td>
</tr>
<tr>
<td>X5</td>
<td>-2.222362</td>
<td>-7.649037</td>
<td>-11.34844</td>
<td>-0.05051051</td>
<td>1.67390100</td>
</tr>
</tbody>
</table>
Note how very different the result looks compared to the correlation matrix above. For example (in absolute numbers) the variance between X1 and X3 is much higher than the variance between X1 and X2, although we know for sure that the relation should actually be inverse. Looking at the correlation matrix, the relations make much more sense.
Fortunately, it's pretty easy to perform a PCA based on the correlation matrix by using the <code>cor = TRUE</code> parameter in the <code>princomp</code> function.
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">pc</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">princomp</span><span class="p">(</span><span class="n">data</span><span class="p">,</span><span class="w"> </span><span class="n">cor</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span><span class="n">summary</span><span class="p">(</span><span class="n">pc</span><span class="p">)</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="n">pc</span><span class="p">)</span></code></pre></figure>
<table>
<tbody><tr>
<th>Importance of components:</th>
<th>Comp.1</th>
<th>Comp.2</th>
<th>Comp.3</th>
<th>Comp.4</th>
<th>Comp.5</th>
</tr>
<tr><td>Standard deviation</td>
<td>1.649629</td>
<td>1.013166</td>
<td>0.9597193</td>
<td>0.50290695</td>
<td>0.27972026</td></tr>
<tr><td>Proportion of Variance</td>
<td>0.544255</td>
<td>0.205301</td>
<td>0.1842122</td>
<td>0.05058308</td>
<td>0.01564869</td>
</tr>
<tr><td>Cumulative Proportion</td>
<td>0.544255</td>
<td>0.749556</td>
<td>0.9337682</td>
<td>0.98435131</td>
<td>1.00000000</td>
</tr>
</tbody></table>
![Screeplot](/public/img/2015-05-31-screeplot2.png "Screeplot")
Looking at the summary output and the screeplot, we'd probably conclude that having three components would be a good choice. Let's look at the loadings.
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">loadings</span><span class="p">(</span><span class="n">pc</span><span class="p">)</span></code></pre></figure>
<table>
<tbody><tr><th>Loadings</th>
<th>Comp.1</th>
<th>Comp.2</th>
<th>Comp.3</th>
<th>Comp.4</th>
<th>Comp.5</th>
</tr>
<tr><td>X3</td>
<td> </td>
<td>0.772</td>
<td>0.628</td>
<td> </td>
<td> </td>
</tr>
<tr><td>X1</td>
<td>-0.586</td>
<td> </td>
<td> </td>
<td>-0.201</td>
<td>0.779</td>
</tr>
<tr><td>X4</td>
<td>-0.551</td>
<td> </td>
<td> </td>
<td>0.801</td>
<td>-0.218</td>
</tr>
<tr><td>X2</td>
<td>-0.569</td>
<td> </td>
<td>0.118</td>
<td>-0.564</td>
<td>-0.587</td>
</tr>
<tr><td>X5</td>
<td>0.144</td>
<td>-0.635</td>
<td>0.759</td>
<td> </td>
<td> </td>
</tr>
</tbody></table>
<table>
<tbody><tr><th> </td>
<th>Comp.1</th>
<th>Comp.2</th>
<th>Comp.3</th>
<th>Comp.4</th>
<th>Comp.5</th>
<td> </td>
</tr>
<tr><td>SS loadings</td>
<td>1.0</td>
<td>1.0</td>
<td>1.0</td>
<td>1.0</td>
<td>1.0</td>
</tr>
<tr><td>Proportion Var</td>
<td>0.2</td>
<td>0.2</td>
<td>0.2</td>
<td>0.2</td>
<td>0.2</td>
</tr>
<tr><td>Cumulative Var</td>
<td>0.2</td>
<td>0.4</td>
<td>0.6</td>
<td>0.8</td>
<td>1.0</td>
</tr>
</tbody></table>
This certainly looks better. X3, X1 and X2 now seem to be related together. Interestingly though X3 and X5 also seem to be related. From the loadings table alone, we would certainly still find it difficult to guess that they are actually independent from each other. However, in combination with the correlation matrix further above we could conclude that indeed X1, X2 and X4 belong together since they are highly correlated, and that furthermore X3 and X5 must be separated from each other because they are not correlated. This would then result in three components.
From here we could now go on to rotate our components if we wanted to.
----
Here are a few other good tutorials on PCA:
* [http://www.statmethods.net/advstats/factor.html](http://www.statmethods.net/advstats/factor.html)
* [http://www.r-bloggers.com/pca-and-k-means-clustering-of-delta-aircraft/](http://www.statmethods.net/advstats/factor.html)
* [http://ocw.jhsph.edu/courses/statisticspsychosocialresearch/pdfs/lecture8.pdf](http://ocw.jhsph.edu/courses/statisticspsychosocialresearch/pdfs/lecture8.pdf)
From all the available videos on Youtube on this topic, I especially liked this one:
<iframe width="560" height="315" src="https://www.youtube.com/embed/Od8gfNOOS9o" frameborder="0" allowfullscreen=""></iframe>
This video gives a good overview on the theoretical background of the principal component analysis:
<iframe width="560" height="315" src="https://www.youtube.com/embed/f9mZ8dSrVJA" frameborder="0" allowfullscreen=""></iframe>
Once you have performed a principal component analysis, you might then want to perform a principal component regression.
<iframe width="560" height="315" src="https://www.youtube.com/embed/-5nnciZ9hgc" frameborder="0" allowfullscreen=""></iframe>
</th></tr></tbody></table></tr></tr></table>
Four Questions To Ask During A Job Interview2015-02-05T00:00:00+01:00http://fabian-kostadinov.github.io/2015/02/05/four-questions-to-ask-during-a-job-interview<p>During the last few months, I had several job interviews with different prospective employers. A few weeks ago a working colleague pointed me to four questions that I could ask my prospective employer that would tell me a lot about the working culture I’d join. As my experience with asking these questions was very positive, I thought I publish them here in my blog. To the best of my knowledge, the <a href="http://karrierebibel.de/arbeitgebertest-was-sie-ihren-kuenftigen-chef-fragen-sollten/">original (German) version comes from a guy called Lars Vollmer</a>. What follows is nothing but a translation from German to English.<span class="more"></span></p>
<blockquote>
__Why do you work here?__
</blockquote>
<p>This question tells you a lot about your counterparty’s inner motivation. One interviewer for example quite frankly expressed to me that in his company paying the highest salaries in the industry was not something they were striving for. At the same time, he seemed genuinely satisfied with the company’s culture and how people treat each other. This actually agreed with my own values, and so the answer was useful for finding out.</p>
<blockquote>
__Why does your enterprise exist?__
</blockquote>
<p>This question tells you whether the management and the company as a whole knows what they are actually doing. Do they have a business strategy in place? Are they aware of the actual business need? If this is a listed company, you can compare the company’s long-term stock trend with what your interviewer answers. If the company’s stock price is in a long-term decline but your counterparty keeps reiterating about future business expansion plans, something might be fishy.</p>
<blockquote>
__What will I need a permit for as an employee?__
</blockquote>
<p>This one is about freedom and how much influence you actually have on what is happening around you. For example, being a software engineer by education, one question I repeatedly asked was: What if I supposedly wanted to have my own virtual server space to try out a few things - how difficult would it be to get this? It was quite amazing to see how much such a simple question was able to tell me about the internals of the company. Another interesting question was to ask whether the company used any agile software development methodology like <a href="http://en.wikipedia.org/wiki/Scrum_%28software_development%29">Scrum</a>. One should assume that nowadays most companies have already learned from failed IT projects in the past relying on outdated methodologies such as the <a href="http://en.wikipedia.org/wiki/Waterfall_model">waterfall model</a>, but this is not the case at all.</p>
<blockquote>
__What was the last big mistake, and how was it dealt with?__
</blockquote>
<p>This is about honesty and a certain level of humbleness, but also about the degree of insight into the complexities of an alive organization. Are the superiors aware of the fact that mistakes can never completely be avoided and that they should actually be seen as opportunities for further self-improvement? When something bad has happened, is there analysis and learning or denying and dismissal?</p>
<p>In my experience not only did the questions help me to get a more concise picture of the company’s working culture, but my prospective employers generally also seemed to find joy in answering some challenging questions as well as appreciated critical thinking on my side. And should you unexpectedly be confronted with a person being seriously offended by any of these questions, then how much will this tell you about your prospective employer? I, for my part, would certainly take my hat and leave. Ask yourself this:</p>
<ul>
<li>Do I want to work in a company where critical questions are allowed to be asked?</li>
<li>Where there is an environment of mutual respect and encouragement even given a situation of disagreement on a certain topic?</li>
<li>Does this job role maybe silently imply to only execute orders given from my manager? Or is there space for me and my colleagues to co-shape the work environment?</li>
<li>Are the rules currently in place actually supportive for my work, or are they rather rigid conventions that will sooner or later prohibit me from achieving my - and ultimately my employer’s - goals?</li>
</ul>
<p>Don’t forget, these questions are not to criticise anyone! The goal is not to be unfriendly or nagging, but to investigate. That is, to find out whether this is a place where you actually would enjoy working. Be creative when some of the questions don’t apply exactly. For example, one startup I was having interviews with could of course not answer questions about “their last big mistake”. But since my prospective senior has had experience with leading a team in his former career, I simply re-coined the question to his former work experience. He actually was content to answer this question.</p>
How I Created a Simple DBMS using GitHub, Jekyll, Prose and Heroku2015-02-04T00:00:00+01:00http://fabian-kostadinov.github.io/2015/02/04/how-i-created-a-simple-dbms-using-github-jekyll-prose-and-heroku<p>Recently, I created a simple database management system including user rights management relying on GitHub, Jekyll, Prose, Heroku and a few other open source products. The basic idea is to store all data inside a <code>_data</code> directory in a GitHub repository. A user can access this data through a website (Prose) and manipulate it through a HTML form (JSONForm), but she needs to be authorized to do so (Jekyll-Auth). The solution also offers a simple, yet functional search functionality. Here is a list of products I relied upon:<span class="more"></span></p>
<ul>
<li><a href="https://github.com">GitHub</a>: A (paid) organizational account is required, which allows the creation of teams and private repositories. Only one private repository is needed. This repository will contain all files in our database, whereas a team will define which users have access to the repo’s data. The data will be kept exclusively in the repository’s <em>master</em> branch, the <em>gh-pages</em> branch is not used.</li>
<li><a href="http://jekyllrb.com/">Jekyll</a>: Jekyll introduced a new feature in v2.1 - the <code>_data</code> directory. Our database will store information in files which are located in this directory or in a subdirectory.</li>
<li><a href="http://www.heroku.com/">Heroku</a>: Heroku will be used to host a Jekyll server (and some additional tools) on which a website will be run where users can interact with the data stored on GitHub through an adapted version of Prose.</li>
<li><a href="https://github.com/benbalter/jekyll-auth">Jekyll-Auth</a>: Websites hosted in a <code>gh-pages</code> branch of a GitHub repository are made accessible to the whole world, even if the repository itself is private. This is not what we want for a database management system. By hosting our website on a Jekyll server running on Heroku and using Jekyll-Auth, we keep the website itself in a protected zone and can control access to both website and data.</li>
<li><a href="https://github.com/prose/prose">Prose</a>: Prose is an online editor for editing (markdown, HTML and other) files being kept in a repository at GitHub.</li>
<li><a href="https://github.com/rypan/jekyll-db">Jekyll-DB</a>: To present an overview of all data to the user and allow free text search I used Jekyll-DB.</li>
<li><a href="https://github.com/joshfire/jsonform">JSONForm</a>: I wanted to use a JSON Schema-based tool to create HTML forms for my input data and chose JSONForm. As I found out later, JSONForm does not seem to be actively developed and maintained anymore. I also stumbled over a few bugs or shortcomings. So I’d now rather recommend using something different such as <a href="http://www.alpacajs.org/">Alpaca Forms</a>.</li>
</ul>
<p>First, a general understanding of <a href="http://jekyllrb.com/docs/datafiles/">Jekyll’s data files feature</a> is needed. Current versions of Jekyll server allow to create a directory named <code>/_data</code>. All valid CSV-, JSON- or Yaml-files added to this directory are queried by Jekyll. This allows one to create Liquid-templates that display aggregate information from all these files. These templates can easily be included in a standard HTML page. This is the basic functionality we rely upon.</p>
<p>I further installed Jekyll-DB as part of my website. According to its documentation it can loop through all posts in a <code>_post</code> directory, but Jekyll-DB works just as fine with data files kept in <code>_data</code>. Jekyll-DB is really easy to install and works well with a few hundred file entries. It’s not very well suited for full text search in free text forms though that contain longer text entries. But for my purposes it was sufficient. Jekyll-DB is shown in the next picture. In the example, all entries containing the string “tomat” are filtered and displayed.</p>
<p><img src="/public/img/2015-02-04-jekyll-db-screenshot.png" alt="Jekyll-DB Screenshot" title="Jekyll-DB Screenshot" /></p>
<p>The generated HTML pages are part of a whole website that must be hosted by a web server. Of course we could simply put all our files in a GitHub repository’s <code>gh-pages</code> branch. Jekyll running on GitHub would then host and serve the full website. This indeed works. The only problem is that I had to handle more sensitive data. Remember: Even if the repository itself is private, whatever is published into the <code>gh-pages</code> branch is available to the whole world. Thus, I could not rely on GitHub’s internal Jekyll server. Of course I could have installed and run a Jekyll server somewhere on my own machine, but this would imply that I needed to run my own (virtual) server, secure the server environment, maintain it etc. Furthermore, for every update pushed to the GitHub repository I’d manually have to update the website’s content on my server.<br />
Fortunately, there’s just another handy tool for exactly this situation: Jekyll-Auth. I have written an <a href="/2014/11/13/installation-of-jekyll-auth/">extensive article on how Jekyll-Auth works</a>. Basically, Jekyll-Auth is a Gem file plus a collection of Ruby files that are added to your website which can be installed using <a href="http://bundler.io/">Bundler</a>. By calling <code>bundle install</code> a Jekyll server is installed on a Heroku account acting as a web server for your website. The deal is that whenever a person tries to access the website the user has to provide GitHub username and password. Jekyll running on Heroku then tries to authenticate and authorize the user by sending an authorization request to GitHub. GitHub checks whether the user is actually a member of a GitHub organization and team that is allowed to view and modify content stored in the GitHub repository. GitHub then returns an authorization token to Jekyll running on Heroku that either allows or denies access to the website.</p>
<p>Jekyll-Auth by itself is not updated when a user pushes changes to the GitHub repository. In earlier days we could have used something like <a href="https://github.com/himynameisjonas/heroku-deployer/">Heroku Deployer</a> relying on a webhook to achieve this, but nowadays there’s <a href="https://devcenter.heroku.com/articles/github-integration">built-in support to integrate Heroku with GitHub</a>. With a few clicks in my Heroku’s dashboard I can easily configure Heroku to update whenever a (push) event is triggered at my GitHub repository.</p>
<p>Let’s just quickly recall what we have now. First, we can store CSV-/JSON-/Yaml- data files inside a <code>_data</code> directory and use Liquid templates to dynamically generate HTML output files aggregating data from all these data files. Second, we have a rudimentary search capability over all data files by using Jekyll-DB. Third, using a combination of private repository with Jekyll-Auth running on Heroku we can actually grant access to out website to only those users who are also members of our GitHub organization and team. Not bad.</p>
<p>What is still missing is a capability of easily manipulating data files. Of course we could simply do this either in our local clone of the GitHub repository and push our changes upstream. Or we could do this directly online through GitHub’s website. For my users I wanted to have something more user-friendly. Usually users are presented with an input form containing text and number fields where they can manipulate the data and then simply press a <em>Submit</em> button to validate the entered data and send the changes to the server.</p>
<p>Because my data files were all in a JSON format, I decided to use a JSON schema aware input form generator tool like <a href="https://github.com/joshfire/jsonform">JSONForm</a> or <a href="http://www.alpacajs.org/">Alpaca Forms</a>. I chose JSONForm but since I ran into a few bugs and shortcomings (e.g. missing multi-selection lists, some CSS problems when trying to integrate JSONForm into an existing HTML page with its own CSS) and because it is no longer actively developed I’d probably recommend something else now. The next picture shows a JSONForm-based input form embedded in the Prose editor. Currently, a file <em>101-tomaten-prod.json</em> is opened.</p>
<p><img src="/public/img/2015-02-04-jsonform-prose-screenshot.png" alt="JSONForm Prose Screenshot" title="JSONForm Prose Screenshot" /></p>
<p>One must be aware that these client-side input form frameworks can of course only provide client-side form validation, which is naturally limited by functionality and insecure. However, server-side validation would require a significantly more complicated solution when using Jekyll running on Heroku. For my purposes, client-side validation was enough.</p>
<p>JSONForm helped me to create input forms, but I still needed a basic framework where to embed these forms. The framework needs to accept user input, call JSONForm’s validation routine, as well as add, commit and push data to the GitHub repository. This can be done either directly through the GitHub API or with the help of a Javascript library such as <a href="https://github.com/creationix/js-git">js-git</a> or maybe the (outdated?) <a href="https://github.com/danlucraft/git.js">Git.js</a>. However, I would probably have ended up still writing quite an amount of code myself. So, instead of just using a Javascript library, I downloaded Prose and tweaked it a little. Prose is basically an online file editor backed by GitHub. You can get the source code yourself and run it on your own server, or you can use the free online version running at <a href="http://prose.io/">Prose.io</a>. My idea was that whenever a user would open a file that matches a certain file name pattern - for example <code>*-blah.json</code> - prose would not open the standard text editor but instead open an input form fitting on this file type backed by JSONForm. I must admit that my solution is quite an ugly hack, but it was not too easy for me to see through Prose’s design as the documentation is, uhm, somewhat lacking details that would make your life a little easier. I ended up basically modifying only a few files:</p>
<ul>
<li><code>prose/app/util.js</code>: Added some code to recognize different JSON files according to their file name pattern.</li>
<li><code>prose/app/models/file.js</code>: Added a little code also necessary to recognize different JSON files according to their file name pattern.</li>
<li><code>prose/app/views/file.js</code>: This file contained the main changes. Basically, I had to provide code that whenever a matching file is requested not the standard editor is opened but our own input form. When a user presses the submit button, validation must be triggered.</li>
<li><code>prose/app/views/jsonformformat.js</code>: I also added another file where I specified the JSON schema required for JSONForm.</li>
<li>I also had to make some minor changes to <code>prose/index.html</code> and maybe a few other files.</li>
</ul>
<p>As I said my code is quite a hack, but if anyone is interested in the source code I can provide it.</p>
<p>One should be aware that because <a href="https://github.com/prose/gatekeeper">Prose requires a running gatekeeper server</a> that handles authorization and authentication with GitHub, I had to also install the gatekeeper server on Heroku. Thus, I ended up running two different server applications - Jekyll and Prose Gatekeeper.</p>
<p>Finally, I had everything in place. With Prose and JSONForm I provided a framework for data manipulation through user input forms. In combination with all the other things I finally had created a simple, yet effective database management system based on GitHub, Jekyll, Prose and Heroku.</p>
<hr />
<p>If you want to try out yourself, we made the code publicly available under <a href="https://github.com/Eaternity/eaternity-database-public">https://github.com/Eaternity/eaternity-database-public</a>.</p>
Comparing ADF Test Functions in R2015-01-27T00:00:00+01:00http://fabian-kostadinov.github.io/2015/01/27/comparing-adf-test-functions-in-r<p>In one of my last posts I was not sure how R’s different ADF test functions worked in detail. So, based on <a href="http://r.789695.n4.nabble.com/Howto-test-for-true-stationarity-with-adfTest-td3224977.html">this discussion thread</a> I set up a simple test. I created four time series:</p>
<ol>
<li><em>flat0</em>: stationary with mean 0,</li>
<li><em>flat20</em>: stationary with mean 20,</li>
<li><em>trend0</em>: trend stationary with “trend mean” crossing through (0, 0) - i.e. without intercept,</li>
<li><em>trend20</em>: trend stationary with “trend mean” crossing through (0, 20) - i.e. with intercept 20.<span class="more"></span></li>
</ol>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="n">tseries</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">xts</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">fUnitRoots</span><span class="p">)</span><span class="w">
</span><span class="n">flat0</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">xts</span><span class="p">(</span><span class="n">rnorm</span><span class="p">(</span><span class="m">100</span><span class="p">),</span><span class="w"> </span><span class="n">Sys.Date</span><span class="p">()</span><span class="m">-100</span><span class="o">:</span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="n">flat0</span><span class="p">)</span><span class="w">
</span><span class="n">flat20</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">xts</span><span class="p">(</span><span class="n">rnorm</span><span class="p">(</span><span class="m">100</span><span class="p">),</span><span class="w"> </span><span class="n">Sys.Date</span><span class="p">()</span><span class="m">-100</span><span class="o">:</span><span class="m">1</span><span class="p">)</span><span class="m">+20</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="n">flat20</span><span class="p">)</span><span class="w">
</span><span class="n">trend0</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">flat0</span><span class="o">+</span><span class="p">(</span><span class="n">row</span><span class="p">(</span><span class="n">flat0</span><span class="p">)</span><span class="o">*</span><span class="m">0.1</span><span class="p">)</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="n">trend0</span><span class="p">)</span><span class="w">
</span><span class="n">trend20</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">flat0</span><span class="o">+</span><span class="p">(</span><span class="n">row</span><span class="p">(</span><span class="n">flat0</span><span class="p">)</span><span class="o">*</span><span class="m">0.1</span><span class="p">)</span><span class="m">+20</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="n">trend20</span><span class="p">)</span></code></pre></figure>
<p>Here are corresponding plots:
<img src="/public/img/2015-01-27-comparing-adf-functions-in-r-flat0.png" alt="flat0" title="flat0" />
Notice the mean at the zero line and the absence of a trend.</p>
<p><img src="/public/img/2015-01-27-comparing-adf-functions-in-r-flat20.png" alt="flat20" title="flat20" />
Notice the mean at the 20 line and the absence of a trend.</p>
<p><img src="/public/img/2015-01-27-comparing-adf-functions-in-r-trend0.png" alt="trend0" title="trend0" />
Notice that the trending mean crosses the origin (0, 0) and the presence of a trend with slope 0.1.</p>
<p><img src="/public/img/2015-01-27-comparing-adf-functions-in-r-trend20.png" alt="trend20" title="trend20" />
Notice that the trending mean crosses the (0, 20) point and the presence of a trend with slope 0.1.</p>
<p>Next, I calculated ADF tests using both the <code>adf.test</code> function in the <em>tseries</em> package and the <code>adfTest</code> function in the <em>fUnitRoots</em> package. We must keep in mind the following few points:</p>
<ul>
<li><code>adf.test</code> in <em>tseries</em> always automatically detrends the given time series.</li>
<li><code>adfTest</code> in <em>fUnitRoots</em> has three different type options: <code>nc</code>, <code>c</code> and <code>ct</code>.</li>
</ul>
<p>From R’s documentation of the <code>adfTest</code> function:</p>
<blockquote>_type_: a character string describing the type of the unit root regression. Valid choices are "nc" for a regression with no intercept (constant) nor time trend, and "c" for a regression with an intercept (constant) but no time trend, "ct" for a regression with an intercept (constant) and a time trend. The default is "c".</blockquote>
<p>So, let’s run the tests.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">adf.test</span><span class="p">(</span><span class="n">flat0</span><span class="p">,</span><span class="w"> </span><span class="n">alternative</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"stationary"</span><span class="p">,</span><span class="w"> </span><span class="n">k</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">)</span><span class="w">
</span><span class="n">adf.test</span><span class="p">(</span><span class="n">flat20</span><span class="p">,</span><span class="w"> </span><span class="n">alternative</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"stationary"</span><span class="p">,</span><span class="w"> </span><span class="n">k</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">)</span><span class="w">
</span><span class="n">adf.test</span><span class="p">(</span><span class="n">trend0</span><span class="p">,</span><span class="w"> </span><span class="n">alternative</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"stationary"</span><span class="p">,</span><span class="w"> </span><span class="n">k</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">)</span><span class="w">
</span><span class="n">adf.test</span><span class="p">(</span><span class="n">trend20</span><span class="p">,</span><span class="w"> </span><span class="n">alternative</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"stationary"</span><span class="p">,</span><span class="w"> </span><span class="n">k</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">)</span><span class="w">
</span><span class="n">adfTest</span><span class="p">(</span><span class="n">flat0</span><span class="p">,</span><span class="w"> </span><span class="n">lags</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"nc"</span><span class="p">)</span><span class="w">
</span><span class="n">adfTest</span><span class="p">(</span><span class="n">flat20</span><span class="p">,</span><span class="w"> </span><span class="n">lags</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"nc"</span><span class="p">)</span><span class="w">
</span><span class="n">adfTest</span><span class="p">(</span><span class="n">trend0</span><span class="p">,</span><span class="w"> </span><span class="n">lags</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"nc"</span><span class="p">)</span><span class="w">
</span><span class="n">adfTest</span><span class="p">(</span><span class="n">trend20</span><span class="p">,</span><span class="w"> </span><span class="n">lags</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"nc"</span><span class="p">)</span><span class="w">
</span><span class="n">adfTest</span><span class="p">(</span><span class="n">flat0</span><span class="p">,</span><span class="w"> </span><span class="n">lags</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"c"</span><span class="p">)</span><span class="w">
</span><span class="n">adfTest</span><span class="p">(</span><span class="n">flat20</span><span class="p">,</span><span class="w"> </span><span class="n">lags</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"c"</span><span class="p">)</span><span class="w">
</span><span class="n">adfTest</span><span class="p">(</span><span class="n">trend0</span><span class="p">,</span><span class="w"> </span><span class="n">lags</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"c"</span><span class="p">)</span><span class="w">
</span><span class="n">adfTest</span><span class="p">(</span><span class="n">trend20</span><span class="p">,</span><span class="w"> </span><span class="n">lags</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"c"</span><span class="p">)</span><span class="w">
</span><span class="n">adfTest</span><span class="p">(</span><span class="n">flat0</span><span class="p">,</span><span class="w"> </span><span class="n">lags</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"ct"</span><span class="p">)</span><span class="w">
</span><span class="n">adfTest</span><span class="p">(</span><span class="n">flat20</span><span class="p">,</span><span class="w"> </span><span class="n">lags</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"ct"</span><span class="p">)</span><span class="w">
</span><span class="n">adfTest</span><span class="p">(</span><span class="n">trend0</span><span class="p">,</span><span class="w"> </span><span class="n">lags</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"ct"</span><span class="p">)</span><span class="w">
</span><span class="n">adfTest</span><span class="p">(</span><span class="n">trend20</span><span class="p">,</span><span class="w"> </span><span class="n">lags</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"ct"</span><span class="p">)</span></code></pre></figure>
<p>The following table contains the p-values produced by the corresponding ADF function. If the p-value is lower than 0.05 it is significant at the 95% confidence threshold.</p>
<table>
<tr>
<th>ADF function</th>
<th>flat0</th>
<th>flat20</th>
<th>trend0</th>
<th>trend20</th>
</tr>
<tr>
<td><code>adf.test(<series>, alternative = "stationary", k = 0</code></td>
<td>< 0.01</td>
<td>< 0.01</td>
<td>< 0.01</td>
<td>< 0.01</td>
</tr>
<tr>
<td><code>adfTest(<series>, lags = 0, type = "nc")</code></td>
<td>< 0.01</td>
<td>0.5294</td>
<td>0.4274</td>
<td>0.7736</td>
</tr>
<tr>
<td><code>adfTest(<series>, lags = 0, type = "c")</code></td>
<td>< 0.01</td>
<td>< 0.01</td>
<td>0.1136</td>
<td>0.1136</td>
</tr>
<tr>
<td><code>adfTest(<series>, lags = 0, type = "ct")</code></td>
<td>< 0.01</td>
<td>< 0.01</td>
<td>< 0.01</td>
<td>< 0.01</td>
</tr>
</table>
<p>Voilà! We can now understand much better what the different functions and type parameters effectively do.</p>
<p>The first function <code>adf.test</code> always indicates a significant stationarity - even if the time series are clearly trending. This is of course because it first detrends the time series, and thus the returned p-value does not distinguish between true stationarity and trend stationarity. Neither does it seem to make any difference concerning the “absolute level”, i.e. the intercept at which the time series is.</p>
<p>The fourth function <code>adfTest(<series>, lags = 0, type = "ct")</code> returns the same results as <code>adf.test</code>. In other words, it both detrends and handles an occurring intercept. This is not suited to distinguish between true stationarity and trend stationarity.</p>
<p>The second function <code>adfTest(<series>, lags = 0, type = "nc")</code> clearly distinguishes between each case: intercept vs. no intercept and trending vs. not trending. Only <em>flat0</em> passes the test of being stationary around a mean of 0 and having no trend. All others have either one or the other. We must think carefully here. Should we care about spreads of two cointegrated stocks having a non-zero intercept (as long as there is no trend)? As long as we are aware that the spreads will not revert to the zero line but to another mean equal to the intercept everything is still fine. We are only interested <em>if</em> stocks do revert to a mean, but not <em>where</em> this mean lies. Nevertheless, at some point we of course still need to know where this mean is in order to implement our trading strategy.</p>
<p>The third function <code>adfTest(<series>, lags = 0, type = "c")</code> apparently is best suited for the purpose of identifying pairs of stocks for pair trading. No mather whether our calculated spreads are centered around a mean (intercept) of 0 or not, it handles both situations. As soon as there is a trend however it does not any longer indicate stationarity. However, notice the relatively low p-values. Although not significant at a 95% confidence threshold, if the trend slope was less extreme (e.g. 0.05 compared to 0.1) then this test might actually ultimately pass.</p>
Cointegration and Total-Least-Squares Regression2015-01-27T00:00:00+01:00http://fabian-kostadinov.github.io/2015/01/27/cointegration-and-total-least-squares-regression<p>I just stumbled over a very nice <a href="http://quanttrader.info/public/betterHedgeRatios.pdf">article authored by Paul Teetor on the use of total least-squares-regression in contrast to ordinary-least-squares regression for cointegration tests</a>. <a href="http://www.cerebralmastication.com/2010/09/principal-component-analysis-pca-vs-ordinary-least-squares-ols-a-visual-explination/">This blog post</a> also explains the same topic.<span class="more"></span></p>
<p>Let’s quickly recall what we do when trying to find a working pairs trading strategy. First, we use one stock price time series to estimate another stock price time series. The linear regression model is <code>A = intercept + beta * B</code>, where A and B are both time series containing stock prices. From the regression estimation we retrieve the <em>hedge ratio</em> (or <em>beta</em> or regression <em>slope</em>) and an <em>intercept</em>. We can then compute the spreads <em>S</em> as <code>S = A - (intercept + beta * B)</code>. Finally, we use an augmented Dickey-Fuller (ADF) test to find out whether the spreads are stationary. If they are, this implies that both stocks are in fact cointegrated. Gaps between the two stock prices will not persist infinitely long, but close after some time.</p>
<p>In R you can use the <code>lm</code> function to create a linear regression estimation: <code>lm(StockA ~ StockB)</code>, or <code>lm(StockA ~ StockB + 0)</code> if you want to explicitly set the intercept to 0. The <em>lm</em> function however relies on an ordinary-least-squares regression method, and both articles linked above do a nice job explaining why this is somewhat problematic. To make it short, we would like to be one hedge ratio to be the inverted value of the other, when we switch the dependent and the independent variable. However, this is not the case with ordinary-least-squares regression. This is from Teetor’s article:</p>
<figure class="highlight"><pre><code class="language-shell-session" data-lang="shell-session"><span class="go">OLS for A vs. B = -0.6583409
OLS for B vs. A = -1.03481
TLS for A vs. B = -0.7611657
TLS for B vs. A = -1.313774</span></code></pre></figure>
<p>“OLS” of course means “ordinary-least-squares regression” whereas “TLS” means “total-least-squares regression”. He goes on writing:</p>
<blockquote>The OLS hedge ratios are inconsistent because 1 / -0.658 = -1.520, not -1.035, which is a substantial difference. The TLS ratios, however, are consistent because 1 / -0.761 = -1.314.
</blockquote>
<p>The reason becomes immediately clear when you look at the pictures provided in the linked articles. In TLS regression the residuals are actually computed <em>orthogonal to the regression line</em>, whereas in ordinary-least-squares they are computed <em>orthogonal to the stock A</em> (if A is the dependent and B the independent variable) or <em>orthogonal to stock B</em> (if A is the independent and B the dependent variable). Using OLS regression “long A and short B” is <em>not</em> the opposite of “long B and short A”. This only becomes the case if we use TLS regression.</p>
<p>But how then can we calculate a total-least-squares regression in R? Unfortunately, the <em>lm</em> function does not provide such functionality. Instead, Teetor suggests to use R’s <em>principal component analysis</em> function <code>princomp</code>.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">r</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">princomp</span><span class="p">(</span><span class="o">~</span><span class="w"> </span><span class="n">StockA</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">StockB</span><span class="p">)</span><span class="w">
</span><span class="n">beta</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">r</span><span class="o">$</span><span class="n">loadings</span><span class="p">[</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">r</span><span class="o">$</span><span class="n">loadings</span><span class="p">[</span><span class="m">2</span><span class="p">,</span><span class="m">1</span><span class="p">]</span><span class="w">
</span><span class="c1"># We don't really need the intercept, but it's nice to display it anyway</span><span class="w">
</span><span class="n">intercept</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">r</span><span class="o">$</span><span class="n">center</span><span class="p">[</span><span class="m">2</span><span class="p">]</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">beta</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">r</span><span class="o">$</span><span class="n">center</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w">
</span><span class="n">spread</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">StockA</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">beta</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">StockB</span></code></pre></figure>
<p>The spreads may or may not be far away from calculating them through a OLS regression model. There are a few more details in Teetor’s paper and I really recommend reading through it. In case you want to understand <a href="http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf">how principal component analysis actually works, there is another good article written by Lindsay Smith</a>.</p>
How To Find A GitHub Team ID2015-01-16T00:00:00+01:00http://fabian-kostadinov.github.io/2015/01/16/how-to-find-a-github-team-id<p>In an earlier post I explained <a href="/2014/11/13/installation-of-jekyll-auth/">how to install Jekyll-Auth</a>. In GitHub, every team (and organization and user) receives a six to seven digits integer number as an ID like <code>1234567</code>. There are cases where you might need access to this information, for instance during the installation of <a href="https://github.com/benbalter/jekyll-auth/">Jekyll-Auth</a>. Unfortunately, there is no easy way to find out a team’s ID. I could not find it anywhere published at the official GitHub website. You can however access this information through <a href="https://developer.github.com/v3/">the GitHub API</a>.<span class="more"></span></p>
<p>The first thing you need to do is to make sure you have <a href="http://curl.haxx.se/">the <em>cURL</em> tool</a> installed. What is curl? To cite the website:</p>
<blockquote>curl is a command line tool and library for transferring data with URL syntax, supporting [...] HTTP [...]</blockquote>
<p>We will use cURL to send simple HTTP commands to the GitHub API. The response is a HTTP header followed by the requested information in a JSON format.</p>
<p>Once you have cURL installed, you first need to make sure that you are actually a member of the organization and team you want to find the ID for.</p>
<p>Next, you need to create a personal access token. When using the GitHub API you in most cases access protected information, thus we need to authorize first. When logged in to your GitHub account, go to -> <em>Settings</em> -> <em>Applications</em> -> (Personal access tokens) <em>Generate new token</em>. Provide a name. Select <code>read:org</code>, leave the others or unselect them. Click <em>Generate token</em>. Copy the token to your clipboard.</p>
<p>Open a shell. In the shell type by replacing the <code><very-long-access-token></code> with the actual access token and <code><org-name></code> with the organization’s name. On Windows I had to use double quotes (“) but maybe on Unix/Linux systems single quotes (‘) are presumed.</p>
<figure class="highlight"><pre><code class="language-shell-session" data-lang="shell-session"><span class="gp">curl -H "Authorization: token <very-long-access-token></span><span class="s2">" https://api.github.com/orgs/<org-name>/teams
</span><span class="go">
[
{
"name": "Owners",
"id": 123456,
"slug": "owners",
"description": null,
"permission": "admin",
"url": "https://api.github.com/teams/123456",
"members_url": "https://api.github.com/teams/123456/members{/member}",
"repositories_url": "https://api.github.com/teams/123456/repos"
},
{
"name": "my-team",
"id": 1234567,
"slug": "my-team",
"description": "This is the team with the ID I am looking for",
"permission": "push",
"url": "https://api.github.com/teams/1234567",
"members_url": "https://api.github.com/teams/1234567/members{/member}",
"repositories_url": "https://api.github.com/teams/1234567/repos"
},
{
"name": "Members",
"id": 234567,
"slug": "members",
"description": null,
"permission": "push",
"url": "https://api.github.com/teams/234567",
"members_url": "https://api.github.com/teams/234567/members{/member}",
"repositories_url": "https://api.github.com/teams/234567/repos"
}
]</span></code></pre></figure>
<p>Of course your output might differ a little bit, just look for the ID <code>"id": 1234567</code> in <em>my-team</em>.</p>
Evolving Trading Strategies With Genetic Programming - Punishing Complexity2015-01-14T00:00:00+01:00http://fabian-kostadinov.github.io/2015/01/14/evolving-trading-strategies-with-genetic-programming-punishing-complexity<h1 id="part-6">Part 6</h1>
<p>One of the most poorly understood and yet at the same time most important concepts of genetic programming (GP) is <em>parsimony pressure</em>. It has long ago been demonstrated that for every type of statistical time series a function can be invented that arbitrarily well matches the observed values in the given time frame if that function is just complex enough. Yet, such a function is effectively worthless. As soon as new observations are added or as soon as predictions should be made for values lying outside the observed time frame the function terribly fails to deliver any meaningful result. I am of course talking about the problem of <em>overfitting</em>.<span class="more"></span></p>
<p>Just take a second and think about it. Which one do you consider the more powerful statistical tool: (multiple) linear regression analysis or (multiple) “non-linear” regression analysis, e.g. based on polynomials? Many beginners probably believe that non-linear regressions should be more powerful than simple linear ones. Theoretically, this is true. But in practice there are at least two grave problems. First, it is rarely the case that you have a clear idea of the type of relation between the independent and the dependent variable(s). Second, allowing anything else than a linear combination of independent variables can easily lead to overfitting thus effectively reducing the predictive power of your regression formula.</p>
<p>When fitting a function to a set of observations, the goal thus must be to create a <em>robust</em> function. In most cases such a robust function does not fit perfectly the given data, but it fits “well enough”. Of course, making a judgement is difficult. But this is the case with every attempt to create a robust trading strategy, not just when using GP. Nevertheless, GP aggravates the problem of overfitting. The more elements a function consists of, that is the more complex it is, the more complicated time series can be fitted. In GP, trading strategies fitting the training data better receive a higher fitness value. The consequence is that more complex functions are naturally preferred by GP if no counter-measures are taken. These more complex functions have again a higher chance to create offspring of which again the most complex ones have the highest chance to fit the historical data. The result is <em>code bloat</em>, in other words highly complex functions overfitting historical data.</p>
<p>Thus, complexity must be punished. In GP, counter-measures taken to prevent against overfitting are called <em>parsimony pressure</em>. To the best of my knowledge there are basically only two different approaches in applying parsimony pressure:</p>
<ol>
<li>Giving a lower fitness value to more complex functions.</li>
<li>Preventing overly complex offspring to be bred.</li>
</ol>
<p>Before we look into each one it should be understood that no matter what parsimony pressure techniques we apply, ultimately it is the designer’s decision what constitutes a robust trading strategy. There are no clear-cut guidelines, no simple measures to state exactly when a function is overfitting historical data. For this reason it is important to use out-of-sample data and to forward-test the strategy on live, never seen before data before starting to trade with real money.</p>
<h3 id="1-lowering-the-fitness-value-of-complex-functions">1. Lowering the fitness value of complex functions</h3>
<p>A rather obvious idea is to punish complexity by giving a lower fitness value to more complex functions. For instance, one might measure the number of nodes or the maximum depth in an evolved GP decision tree and divide something like the expected value by this measure. A higher number of nodes or a larger tree depth naturally leads to a lower fitness value. As demonstrated <a href="/2014/12/22/evolving-trading-strategies-with-genetic-programming-fitness-functions/">in my last article</a>, if the nominator and the denominator are very different from each other, then changes in one might dominate changes in the other. This must be paid respect to. One might question whether measuring tree depth or tree size actually is appropriate when it comes to the complexity of a trading strategy, but can be difficult to come up with a better idea. Unfortunately, both measures require a full traversal of the GP tree which significantly slows down the evolutionary process.<br />
The real disadvantage though is that from my experience this parsimony pressure technique does not perform too well. Often I experienced that despite my attempts nevertheless code bloat was observed. In contrast, using the next type of parsimony pressure techniques gave me much better results.</p>
<h3 id="2-preventing-complex-offspring-to-be-bred">2. Preventing complex offspring to be bred</h3>
<p>Another approach to parsimony pressure is to mechanically prevent the creation of offspring beyond a maximum number of allowed nodes. Typically, if a child is created where the number of nodes (or maybe the maximum node depth) exceeds an allowed maximum, then this individual is rejected and another one is created. ECJ, to give an example, lets you specify the number of attempts to be taken. Here, complexity is not punished by assigning it a lower fitness value (although this could be implemented additionally). Instead, the means used are “mechanical”, it is simply not possible to create offspring exceeding beyond a certain node size. The exact details of this approach are beyond this article’s intended scope and I really recommend taking a closer look to an implementation like ECJ. This approach worked pretty well for me, and I believe it to be superior in terms of result to the first one.</p>
<h2 id="overfitting-beyond-code-bloat">Overfitting beyond code bloat</h2>
<p>There is yet another problem. Even if we manage to control code bloat, another form of overfitting often arises, if we want to call it overfitting at all. Assuming that we already control code bloat, it happens still quite often that GP is able to evolve a trading strategy that looks quite promising on historical in-sample data but that simply falls short on out-of-sample data. That does not necessarily mean that the evolved strategy itself suffers of code bloat and is highly complex. On the contrary, it might even be a quite simple one. The overfitting we observe is of a different kind. Apparently, the function (trading strategy) evolved for the past data simply is not of use for out-of-sample data. However, as this is not the effect of code bloat, the only conclusion we are left with is that apparently what has worked in the in-sample period does not work in the out-of-sample period. This is a frustrating situation for which I have no solution. Running predictability tests might give a clue whether the markets are supposedly predictable or not, but still this is no guarantee. Markets do change, and it is a common phenomenon that trading strategies that worked in the past do no longer work today. Maybe at some point in the future they will start to work again. In such a situation it is the best to just accept the state of affairs and continue to another promising market.</p>
<hr />
<h2 id="other-articles">Other articles</h2>
<h3 id="previous">Previous</h3>
<ul>
<li>Part 5: <a href="/2014/12/22/evolving-trading-strategies-with-genetic-programming-fitness-functions/">Evolving Trading Strategies With Genetic Programming - GP Parameters and Operators</a></li>
</ul>
<h3 id="next">Next</h3>
<ul>
<li>Part 7</li>
</ul>
<hr />
<h1 id="references">References</h1>
Reaktion auf die Verschwörungstheoretiker im Falle Charlie Hebdos2015-01-10T00:00:00+01:00http://fabian-kostadinov.github.io/2015/01/10/reaktion-auf-die-verschworungstheoretiker-im-falle-charlie-hebdos<p>Und erneut schreiben sie wieder - die Verschwörungstheoretiker. Diesmal <a href="http://alles-schallundrauch.blogspot.de/2015/01/die-offizielle-charlie-hebdo-story.html">im Falle Charlie Hebdos</a>. Gar nicht tot, sei er, der erschossene Polizist. Als “Beweis” wird irgendein obskurer Videomitschnitt gezeigt. Es gibt viele Gründe, sich das nicht näher anzuschauen. Eine Reaktion - aus Empörung, man darf es sagen - hier trotzdem.<span class="more"></span></p>
<p>Der Verschwörungstheoretiker ist eigentlich ein Anti-Aufklärer. An die Stelle von Gott, der alles lenkt, setzt er menschliche, dunkle Kräfte, die hinter allem stehen sollen. Damit widerspricht er jedoch der wissenschaftlichen (und somit aufklärerischen) Denkweise, welche postuliert, dass im Normalfall die einfachste Erklärung auch die plausibelste ist. Denn der Verschwörungstheoretiker hält die einfachste Erklärung für Propaganda der dunklen Kräfte, die er ja vorher postuliert hat. Jeder Versuch, ihn zu widerlegen, interpretiert der Verschwörungstheoretiker danach als Beweis für die Richtigkeit seiner eigenen Thesen. Er ist dadurch immun geworden für das, was die Aufklärer “Vernunft” nannten.</p>
<p>Faszinierenderweise teilt der Verschwörungstheoretiker somit zumindest in Teilen die Position jener, die bestrebt sind, die freiheitliche und aufgeklärte Gesellschaft in eine unfreiheitliche und vor-rationale Gesellschaft zurückzuverwandeln. Beiden ist nämlich gemein, dass sie sich gegen die Aufklärung wenden - die einen tatkräftig, die anderen wenigstens geistig. Der Unterschied zwischen beiden Gruppen besteht jedoch auch darin, dass die erste Gruppe (beispielsweise die Terroristen oder der IS) relativ plump agieren. Wer nicht glauben will, der wird halt beseitigt.</p>
<p>Die zweite Gruppe bedient sich eines viel raffinierteren Manövers. Sie kleidet ihre Argumente in ein pseudo-wissenschaftliches Gewand, welches danach nur sehr schwierig und unter äusserster Anstrengung wissenschaftlich widerlegt werden kann. Wer nicht glauben will, der muss ungeheure Anstrengungen unternehmen, um zu widerlegen, was der Verschwörungstheoretiker behauptet hat.
Mit denselben Mitteln der Verschwörungstheorie operierten übrigens die Nazis, als sie versuchten, “den Juden” gegenüber dem Rest der Bevölkerung zu verunglimpfen. Welcher Normalbürger besass damals schon die Möglichkeit, die wüsten, jedoch subtilen Behauptungen zu widerlegen?</p>
<p>Faszinierenderweise hält der Verschwörungstheoretiker keinerlei Lösung bereit. Nie sagt er etwas darüber aus, welche seiner Behauptungen eigentlich welche konkreten Handlungen nach sich ziehen müssten. Was wäre beispielsweise im Fall Frankreich konkret zu tun, das über das blosse Abschaffen von Geheimdiensten hinausginge?</p>
<p>Die - einfache aber nicht einfach zu akzeptierende - Wahrheit lautet: Den <em>Grossen Anderen</em> gibt es nicht. Er ist nirgendwo da draussen, und die Macht, von der wir glauben, dass Er sie habe, ist das, was wir an Macht von uns selbst abgespaltet haben.<br />
Möglicherweise, oder auch nicht, gibt es hier eine enge Verbindung zum Ödipuskomplex. Der Verschwörungstheoretiker lehnt sich gegen das Verbot zu wissen auf. So wie das Kind nicht wissen darf, was hinter den verschlossenen Türen des elterlichen Schlafzimmers eigentlich passiert. Die Eltern, wahrgenommen als übermächtige Wesen mit unverständlichen Kräften, tun Dinge, von denen das Kind nur eine vage Ahnung hat. Die Erfahrung der Ohnmacht des Jungen gegenüber dem eigenen Vater, oder auch des Mädchens gegenüber der eigenen Mutter, bildet das Grundgerüst für die spätere Projektion in eine tatsächlich vorhandene oder auch nur scheinbar wahrgenommene Obrigkeit. Die Auflehnung ist eine Auflehnung gegen das ödipale Verbot.<br />
Aber unabhängig davon, wie sehr oder wie oft sich der Verschwörungstheoretiker gegen jene dunklen Kräfte auflehnt, es nützt am Ende doch nichts. Seit den frühen Verschwörungstheorien über die Illuminaten hat sich im wesentlichen nichts geändert, nur das Medium zur Verbreitung der Theorien hat sich globalisiert. Die Inhalte sind letztlich dieselben. Die Macht, die die dunklen Kräfte angeblich haben sollen, ist eine blosse Projektion. Als Beispiel können gleichermassen die Wall Street-Finanzunternehmen oder die Geheimdienste gelten. Beide operieren, bei näherer Betrachtung, reichlich dilettantisch. Wer einmal hinter die Fassade der Marmorgebäude und automatischen Türen geblickt hat, der kennt aus eigener Erfahrung den Grad an Dilettantismus, welcher in den Chefetagen angeblicher Weltenlenker herrscht.</p>
<p>Es gehört zum Erwachsenwerden dazu, die Vorstellung zu hinterfragen und schliesslich fallenzulassen, die eigenen Eltern hätten irgendeine Ahnung gehabt, was sie eigentlich tun, oder wie sie ihre eigenen Kinder zu erziehen hätten. Auch dazu gehört, die Scham schliesslich loszulassen, die mit der Erkenntnis über die Hilflosigkeit der eigenen Eltern einhergeht. Wenn die Eltern doch keine übermächtigen Wesen sind, dann sind sie am Ende halt einfach nur Menschen. Doch die Verschwörungstheoretiker wollen nicht erwachsen werden. Sie wollen nicht Verantwortung übernehmen für das fröhliche und bisweilen auch grausame Chaos, das in der Welt herrscht, und an dem sie täglich teilnehmen. Der Verschwörungstheoretiker weigert sich letztlich, erwachsen zu werden. Den Schatten, den er projiziert, will er nicht zurücknehmen. Die Ohnmacht, die er empfindet, will er nicht als seine eigene verdrängte Hilflosigkeit anerkennen. An die Stelle von Selbsterkenntnis, setzt der Verschwöruntstheoretiker die Projektion. So hilft er ironischerweise fleissig mit, das Dunkle im Dunkeln zu belassen. Ausgerechnet die Wahrheit des Lichts der Aufklärung, die er für sich in Anspruch nimmt, scheut er.</p>
<p>Interessant zum Thema: <a href="http://www.lacan.com/zizekother.htm"><em>The Big Other Doesn’t Exist</em> - Slavoj Zizek</a>.</p>
Problems Using Syntax Highlighting with Pygments on Windows2015-01-04T00:00:00+01:00http://fabian-kostadinov.github.io/2015/01/04/problems-using-syntax-highlighting-with-pygments-on-windows<p>I wanted to turn on code syntax highlighting using rouge for my blog by adding the following line to my <em>_config.yml</em> file.</p>
<figure class="highlight"><pre><code class="language-yaml" data-lang="yaml"><span class="na">highlighter</span><span class="pi">:</span> <span class="s">rouge</span></code></pre></figure>
<p>Whereas this worked perfectly on my local Windows machine, I ran into problems with Jekyll on GitHub.<span class="more"></span> After searching for some time, I found <a href="http://www.codeproject.com/Articles/809846/Blogging-on-GitHub">this article</a> from August 2014 where it stated:</p>
<blockquote>
__Update: As of August 1, commiting a__ _config.yml __that uses__ rouge __now causes "Page build failure" on GitHub with a misleading error message like "The file__ _posts/2014-08-01-blah.md __contains syntax errors."__ Before you commit & push, you must set highlighter: pygments in _config.yml, even if you don't care to install pygments locally.
</blockquote>
<p>It’s now January 2015, and this was exactly the error message I received. So I figured there was/is a problem with rouge running on GitHub’s Jekyll and decided to use pygments instead. My <em>_config.yml</em> file:</p>
<figure class="highlight"><pre><code class="language-yaml" data-lang="yaml"><span class="na">highlighter</span><span class="pi">:</span> <span class="s">pygments</span></code></pre></figure>
<p>Whereas the change went through smoothly remotely on GitHub’s Jekyll, I ran into problems locally. Whenever I tried to <code>jekyll build</code> my site, I received this error message:</p>
<figure class="highlight"><pre><code class="language-shell-session" data-lang="shell-session"><span class="go">Liquid Exception: No such file or directory - python2 C:/Ruby200-x64/lib/ruby/gems/2.0.0/gems/pygments.rb-0.6.0/lib/pygments/mentos.py in _posts/2000-01-01-my-oldest-post.md</span></code></pre></figure>
<p>I did not know what this meant and it took me some hours to figure it out. The problem is that for any reason pygments is looking for a file <em>python2.exe</em> and cannot find it. I am using an Enthought Canopy python distribution, and what I <em>do</em> have is a file <code>C:\Enthought-Canopy\User\python.exe</code> as well as <code>C:\Enthought-Canopy\User\Scripts\python.exe</code>. First, I had to make sure that both directories containing these files were in my <code>System PATH</code> variable. Do not confuse with the <code>User PATH</code> variable. (Also don’t forget to close and reopen the Windows command prompt afterwards, otherwise your changes are not effective.) Then <a href="http://stackoverflow.com/questions/17364028/jekyll-on-windows-pygments-not-working">I had to make a simple copy of python.exe and name it python2.exe</a>. Alternatively, I could have created a hard link inside the same directory where python.exe exists and name it python2.exe.</p>
<figure class="highlight"><pre><code class="language-shell-session" data-lang="shell-session"><span class="go">cd C:\Enthought-Canopy\User
mklink /H python2.exe python.exe
cd C:\Enthought-Canopy\User\Scripts
mklink /H python2.exe python.exe</span></code></pre></figure>
Cointegration and the Role of the Intercept2015-01-04T00:00:00+01:00http://fabian-kostadinov.github.io/2015/01/04/cointegration-and-the-role-of-the-intercept<p>Yesterday I wanted to find out whether a pair of stocks would be suitable for pair trading. There is a <a href="http://quanttrader.info/public/testForCoint.html">tutorial by Paul Teetor how to test a pair of securities for cointegration</a>. Basically, we use an OLS linear regression model to estimate the absolute prices of one security with the other’s prices. If the residuals, i.e. the spreads, are stationary then we can conclude that both time series are mean-reverting. This implies that both securities are cointegrated. However, one thing remained unclear to me after first going through Teetor’s tutorial - the role of the intercept in the regression model.<span class="more"></span> In the tutorial, when estimating the regression model Teetor explicitly sets the intercept to 0 without explaining further why and what the implications are. As it happened, the two stocks I had selected are a great example to study the effect this choice has.</p>
<p>The following is some R code. First, I loaded the stock data using the quantmod and a few other packages. Looking at a ten years time frame is a questionable decision if the goal is to create a pairs trading strategy, but in this example I’ll do it anyway.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="n">xts</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">tseries</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">quantmod</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">fUnitRoots</span><span class="p">)</span><span class="w">
</span><span class="n">from.dat</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">as.Date</span><span class="p">(</span><span class="s2">"01/06/2004"</span><span class="p">,</span><span class="w"> </span><span class="n">format</span><span class="o">=</span><span class="s2">"%d/%m/%Y"</span><span class="p">)</span><span class="w">
</span><span class="n">to.dat</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">as.Date</span><span class="p">(</span><span class="s2">"03/01/2015"</span><span class="p">,</span><span class="w"> </span><span class="n">format</span><span class="o">=</span><span class="s2">"%d/%m/%Y"</span><span class="p">)</span><span class="w">
</span><span class="n">getSymbols</span><span class="p">(</span><span class="s2">"StockA"</span><span class="p">,</span><span class="w"> </span><span class="n">src</span><span class="o">=</span><span class="s2">"yahoo"</span><span class="p">,</span><span class="w"> </span><span class="n">from</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">from.dat</span><span class="p">,</span><span class="w"> </span><span class="n">to</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">to.dat</span><span class="p">)</span><span class="w">
</span><span class="n">getSymbols</span><span class="p">(</span><span class="s2">"StockB"</span><span class="p">,</span><span class="w"> </span><span class="n">src</span><span class="o">=</span><span class="s2">"yahoo"</span><span class="p">,</span><span class="w"> </span><span class="n">from</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">from.dat</span><span class="p">,</span><span class="w"> </span><span class="n">to</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">to.dat</span><span class="p">)</span><span class="w">
</span><span class="c1"># Let's create some charts of the adjusted close series</span><span class="w">
</span><span class="n">plot.zoo</span><span class="p">(</span><span class="n">merge</span><span class="p">(</span><span class="n">Ad</span><span class="p">(</span><span class="n">StockA</span><span class="p">),</span><span class="w"> </span><span class="n">Ad</span><span class="p">(</span><span class="n">StockB</span><span class="p">)),</span><span class="w"> </span><span class="n">plot.type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"single"</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"red"</span><span class="p">,</span><span class="w"> </span><span class="s2">"blue"</span><span class="p">),</span><span class="w"> </span><span class="n">xlab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Year"</span><span class="p">,</span><span class="w"> </span><span class="n">ylab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Stock Price"</span><span class="p">)</span></code></pre></figure>
<p>This outputs the following chart:
<img src="/public/img/2015-01-04-prices-of-stock-a-and-stock-b.png" alt="Prices of Stocks A and B" title="Prices of Stocks A and B" />
Obviously, the stocks are correlated. Whether or not they are cointegrated remains to be decided. Interestingly, over the years the blue stock outperforms the red stock slowly but constantly. This becomes clear if we plot the price differences between the stocks.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">tmp</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">merge</span><span class="p">(</span><span class="n">Ad</span><span class="p">(</span><span class="n">StockA</span><span class="p">),</span><span class="w"> </span><span class="n">Ad</span><span class="p">(</span><span class="n">StockB</span><span class="p">))</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="n">tmp</span><span class="p">[,</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">tmp</span><span class="p">[,</span><span class="m">2</span><span class="p">],</span><span class="w"> </span><span class="n">main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Price differences Stock A - Stock B"</span><span class="p">)</span></code></pre></figure>
<p><img src="/public/img/2015-01-04-price-differences-stock-a-stock-b.png" alt="Price Differences of Stock A and B" title="Price Differences of Stock A and B" />
Let’s use a <code>lm</code> regression model to calculate the spreads. Be aware that we explicitly set the intercept to 0.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">m</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">lm</span><span class="p">(</span><span class="n">Ad</span><span class="p">(</span><span class="n">StockA</span><span class="p">)</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">Ad</span><span class="p">(</span><span class="n">StockB</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">0</span><span class="p">)</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="n">m</span><span class="o">$</span><span class="n">residuals</span><span class="p">,</span><span class="w"> </span><span class="n">main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Residuals"</span><span class="p">)</span><span class="w">
</span><span class="c1"># The following is not really needed but makes the point clear</span><span class="w">
</span><span class="n">intercept</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">0</span><span class="w">
</span><span class="n">hedgeRatio</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">coef</span><span class="p">(</span><span class="n">m</span><span class="p">)[</span><span class="m">1</span><span class="p">]</span><span class="w">
</span><span class="c1"># sprd contains the same values as m$residuals</span><span class="w">
</span><span class="n">sprd</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">Ad</span><span class="p">(</span><span class="n">StockA</span><span class="p">)</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">hedgeRatio</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">Ad</span><span class="p">(</span><span class="n">StockB</span><span class="p">)</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">intercept</span></code></pre></figure>
<p><img src="/public/img/2015-01-04-residuals-with-zero-intercept.png" alt="Residuals with zero intercept" title="Residuals with non-zero intercept" />
The residuals clearly show a long-term negative trend. Using stock B to estimate stock A with a zero intercept therefore systematically overestimates in the time before 2009 and systematically underestimates in the time after 2009. In fact, the residual plot seems to be the same or nearly so as the price differences plot above. Why is this so? Let’s recall what the regression formula effectively means:</p>
<p><code>Price A<sub>i</sub> = intercept + hedgeRatio * Price B<sub>i</sub> + Error<sub>i</sub></code></p>
<p>We forced the <em>intercept</em> to be 0. Therefore, <em>Price A</em> is a linear combination of <em>Price B</em> - hence the <em>hedgeRatio</em> (or <em>beta</em>) plus an error term (the residual). The sum of the residuals is necessarily 0 in an OLS regression. Therefore, the residual plot is centered around the 0 line. Furthermore, there is only a single hedge ratio value for all observations. Stock B constantly slightly outperforms stock A over the years, and hence the residuals have the observed bias.</p>
<p>Let’s introduce a non-zero intercept.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">m</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">lm</span><span class="p">(</span><span class="n">Ad</span><span class="p">(</span><span class="n">StockA</span><span class="p">)</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">Ad</span><span class="p">(</span><span class="n">StockB</span><span class="p">))</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="n">m</span><span class="o">$</span><span class="n">residuals</span><span class="p">,</span><span class="w"> </span><span class="n">main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Residuals"</span><span class="p">)</span><span class="w">
</span><span class="c1"># The following is not really needed but makes the point clear</span><span class="w">
</span><span class="n">intercept</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">coef</span><span class="p">(</span><span class="n">m</span><span class="p">)[</span><span class="m">1</span><span class="p">]</span><span class="w">
</span><span class="n">hedgeRatio</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">coef</span><span class="p">(</span><span class="n">m</span><span class="p">)[</span><span class="m">2</span><span class="p">]</span><span class="w">
</span><span class="c1"># sprd contains the same values as m$residuals</span><span class="w">
</span><span class="n">sprd</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">Ad</span><span class="p">(</span><span class="n">StockA</span><span class="p">)</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">hedgeRatio</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">Ad</span><span class="p">(</span><span class="n">StockB</span><span class="p">)</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">intercept</span></code></pre></figure>
<p><img src="/public/img/2015-01-04-residuals-with-non-zero-intercept.png" alt="Residuals with non-zero intercept" title="Residuals with non-zero intercept" />
Now the residuals’ bias has disappeared. The intercept therefore “closes the gap” of the continuous outperformance of stock B over stock A, so to speak.</p>
<p>Let’s go on. Are the two time series cointegrated or not? We use an augmented Dickey-Fuller (ADF) test to decide. The <code>adf.test</code> function from the <em>tseries</em> package is quite handy.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">adf.test</span><span class="p">(</span><span class="n">m</span><span class="o">$</span><span class="n">residuals</span><span class="p">,</span><span class="w"> </span><span class="n">alternative</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"stationary"</span><span class="p">,</span><span class="w"> </span><span class="n">k</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">)</span></code></pre></figure>
<p>Be aware that adf.test essentially detrends your input data! This is okay for the non-zero intercept situation where our spreads are more or less without trend. Is it however also okay for the zero intercept situation where the spreads show a long-term negative trend? Well, I would say this depends on your goal. If you are implementing a short-term pair trading strategy and the spreads are trending over years, you might still be okay with a de-trended ADF test. But if your time horizon is the long-term then you cannot simply ignore the trend in the spreads. In the latter case both time series are <em>not</em> mean-reverting!</p>
<p>Here’s the output.</p>
<table>
<tr>
<th></th>
<th>Zero Intercept</th>
<th>Non-zero Intercept</th>
</tr>
<tr>
<td>Intercept</td>
<td>0</td>
<td>28.75452</td>
</tr>
<tr>
<td>Hedge Ratio</td>
<td>1.017733</td>
<td>0.5772449</td>
</tr>
<tr>
<td>adf.test</td>
<td>
<code>Augmented Dickey-Fuller Test<br />
data: sprd<br />
Dickey-Fuller = -6.8646, Lag order = 0, p-value = 0.01<br />
alternative hypothesis: stationary</code>
</td>
<td>
<code>Augmented Dickey-Fuller Test<br />
data: sprd<br />
Dickey-Fuller = -4.8857, Lag order = 0, p-value = 0.01<br />
alternative hypothesis: stationarysum</code>
</td>
</tr>
</table>
<p>In both situations the Dickey-Fuller value is much lower than the 5% threshold of -2.86 (regression model without trend) or -3.41 (regression model with trend), and thus <em>p-value</em> < 0.05. This means that we can be more than 95% confident that the two time series are indeed cointegrated (in the short term).</p>
<hr />
<h1 id="appendix">Appendix</h1>
<p>If you are not okay with the function <code>adf.test</code> to detrend the input data first, then there is an alternative in R. Luckily, there is another ADF test function in the <em>fUnitRoots</em> package called <code>adfTest</code> (mind the spelling). Unlike the <code>adf.test</code> function it lets you specify whether the underlying regression model has or has not an intercept and whether the input time series for the regression has or has not a trend. This short Youtube video explains the situation.</p>
<iframe width="560" height="315" src="//www.youtube.com/embed/jWI_AJKLyKQ" frameborder="0" allowfullscreen=""></iframe>
<p>From the video:</p>
<blockquote>When we take [the] log of any financial series, the trend of that series vanishes. So when we do ADF on log of prices then only intercept should be included [but] not trend. But if we do the ADF at level (original prices, not log) then intercept and trend should be considered at level. But for first differences only intercept should be considered.
</blockquote>
<p>Another <a href="http://stats.stackexchange.com/questions/44647/which-dickey-fuller-test-should-i-apply-to-a-time-series-with-an-underlying-mode">good explanation can be found in this discussion thread</a>.</p>
<p>Stock prices nearly always have a long-term trend, they are not stationary. Furthermore, their levels (original prices) are always different from 0, therefore an intercept is required in the ADF test.</p>
<p>However, remember that before we were not looking at stock prices. Instead, we were looking at residuals, i.e. spreads between two price series. If the residuals are stationary, then we know that both time series are mean-reverting, which indicates that they are cointegrated and a good candidate for a pair trading strategy. If the spreads have a clearly visible trend, then they are already guaranteed not to be stationary. Hence:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">adfTest</span><span class="p">(</span><span class="n">m</span><span class="o">$</span><span class="n">residuals</span><span class="p">,</span><span class="w"> </span><span class="n">lags</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"c"</span><span class="p">)</span></code></pre></figure>
Evolving Trading Strategies With Genetic Programming - Fitness Functions2014-12-22T00:00:00+01:00http://fabian-kostadinov.github.io/2014/12/22/evolving-trading-strategies-with-genetic-programming-fitness-functions<h1 id="part-5">Part 5</h1>
<p>At the core of every genetic programming (GP) strategy is the <em>fitness function</em>. The fitness function specifies what the whole evolutionary process is looking for. Every individual is assigned a <em>fitness value</em>, which is computed by the fitness function. Individuals with a high fitness value stand a higher chance to be selected for reproduction and thus to create offspring. Finding a “good” fitness function is one of the most important design aspects of the development process. It is rarely the case that the first idea for a fitness function already produces great results, and defining one requires quite a deep understanding of the problem domain.<span class="more"></span> The following list contains a few necessary design decisions:</p>
<ul>
<li>Minimizing vs. maximizing fitness values</li>
<li>Single-objective vs. multi-objective</li>
<li>Normalization of fitness values</li>
<li>Assigning weights to individual components of the fitness function</li>
</ul>
<p>Let’s take a closer look at each necessary decision.</p>
<h2 id="maximization-vs-minimization">Maximization vs. minimization</h2>
<p>Fitness functions can of course maximize certain target measures or they can minimize them. Typical target measures to maximize could be <em>total return</em>, <em>expected value</em>, <em>average size of winning trades</em> or <em>hit rate</em>. Typical target measues to minimize could be <em>maximum drawdown</em>, <em>maximum number of consecutive losing trades</em>, or the <em>equity curve’s volatility</em>. A common situation is to maximize some and minimize other fitness values at the same time.</p>
<h2 id="single-objective-vs-multi-objective">Single-objective vs. multi-objective</h2>
<p>A fitness function with a single objective tries to maximize (or minimize) a single fitness value. This fitness value can possibly be the result of a mathematical formula combining multiple individual components, for example <div class="message">Maximize the total return divided by the maximum drawdown.</div> The higher the <em>total return</em> in this formula the higher also the fitness value. The releationship between fitness value and and <em>maximum drawdown</em> is inverse: the lower the <em>maximum drawdown</em> the higher the fitness value. Of course the formula can lead to negative fitness values if <em>total return</em> is actually a negative number. It is also presumed that <em>maximum drawdown</em> can only take positive values. In other words, components of fitness value typically are valid only in a predefined range. As a consequence the fitness value itself is also valid only in a predefined range. For example, if <em>total return</em> is valid in a range <em>(neg. infinity, pos. infinity)</em> and <em>maximum drawdown</em> in a range <em>(0, pos. infinity)</em> the fitness values are automatically in the range of <em>(neg. infinity, pos. infinity)</em>. Of course the computer does not really understand the concept of infinity, and indeed situations might occur where it is necessary to prevent against overflows.</p>
<p>In the sample fitness function above, although possible it would be a bad idea to use negative values for the <em>maximum drawdown</em> measure (arguing that a drawdown can be interpreted as a negative return) as this would be quite confusing.</p>
<p>In this article I will continue to refer to such “combined fitness functions” as single-objective, because the final fitness value is a single value. In contrast, multi-objective fitness functions do not try to aggregate multiple target measures into a single fitness value, but directly work on the multi-dimensional search space without reducing the number of dimensions. Multi-objective fitness functions for GP can be quite complicated to implement, yet from my experience they can actually lead to superior results compared with single-objective ones. I already mentioned two such fitness functions in an earlier article: the <a href="http://www.cleveralgorithms.com/nature-inspired/evolution/nsga.html">non-dominated sorting genetic algorithm</a> (NSGA) and the <a href="http://www.cleveralgorithms.com/nature-inspired/evolution/spea.html">strength pareto evolutionary algorithm</a> (SPEA) both of which I consider to be very powerful. These algorithms actually operate directly in a multi-dimensional “fitness landscape” without reducing the landscape into a one-dimensional “fitness number ray” as combined single-objective fitness functions do. The reader should also be aware that both algorithms by their design predetermine the GP selection and mutation operators. Working implementations for both algorithms can actually be found in the <a href="http://cs.gmu.edu/~eclab/projects/ecj/">ECJ library</a>.</p>
<h2 id="normalization-of-fitness-values">Normalization of fitness values</h2>
<p>Sometimes it is necessary to normalize fitness values to a predetermined range. This might for example be the case if a few ouliers exist that are very far from the other fitness values. If a <em>fitness proportional selection operator</em> is applied, these outliers could easily dominate all others, which is rarely desired. An alternative would be to use <em>rank-based fitness values</em> as they do not suffer from such a problem.</p>
<p>Another problem is the relative difference in sizes of the involved fitness value components. Consider the following fitness function:</p>
<div class="message">Maximize the total return divided by the average loss.</div>
<p>In a situation where the value of the divisor is significantly greater or smaller than the dividend (as in the rule <em>maximize total return divided by average loss</em>), changes in one measure will always dominate changes in the other. If <em>total return</em> is 50’000$ then a 2% change will result in +/- 1000$ for the divisor. Assuming the average loss in the observed trading system is 250$, then a 2% change is +/- 5$. Because +/- 1000$ always by far exceeds +/- 5$, in this example the nominator always dominates the denominator. Even a quite dramatic decrease of the average loss from 250$ to 125$ (average loss is reduced by 125$ or 50%) has much less impact compared to changes in the total return in this formula.</p>
<p>For these reasons (especially when using a combined single-objective fitness function) normalization of the individual components is usually necessary. There are three different alternatives:</p>
<p><strong>1. No normalization</strong><br />
If all components in the fitness function can be expected to have similar distributions (i.e. similar mean and standard deviation), then a normalization might not be necessary. This is rarely the case, however. Outliers are in fact a common phenomenon.</p>
<p><strong>2. Normalization of fitness value components with proportionate distances</strong><br />
The best fitness value component receives a predefined value of 1.0, the worst one 0.0. All other fitness value components are in proportionate distance between the two. (Koza favored a reversed order with the best fitness value component receiving the value 0.0 and the worst 1.0. The corresponding task is then to minimize this component rather than maximize it.) This procedure is still problematic if outliers exist, but sometimes - depending on one’s choice of selection operators - it might be desired to preserve the relative distances between values.</p>
<p><strong>3. Normalization of fitness value components by first ordering all components according to their rank</strong><br />
Like in the previuos procedure the best fitness value component receives a value of 1.0 and the worst 0.0 (or vice versa). All others are in equal distance to each other according to their rank. Unlike the last procedure this one also resolves the problem of outliers. Yet, relevant information concerning relative distances between fitness value components is lost irrevocably.</p>
<h2 id="common-pitfalls">Common pitfalls</h2>
<p>From my experience there are a few common pitfalls for beginners concerning the design of fitness functions. A typical beginner is likely to try apply a single-objective fitness function and try to maximize total return. This is easily understandable. After all, ultimately it’s the amount of cash your trading strategy generated, is it not? As it turns out this approach rarely leads to good results. Although the logic applied is valid, there are all sorts of unsolved problems with maximizing total return. Here are a few points to consider:</p>
<p><strong>Strategy depends on very few trades</strong><br />
In a market of increasing prices it is often difficult to beat the market and generate real alpha. A strategy of buy and hold has the advantage of having very low trading costs. Therefore, in this situation it is not uncommon for the best performing trading strategy to simply buy in the beginning and hold till the end. The best evolved trading strategies correspondingly will probably have none to very few trades - which is just consequent, but still not very desirable. These trading strategies do not represent a repeatable way to success. They basically just perform by avoiding trading costs. Be aware that such “lazy trading strategies” also can imply a higher volatility than you are prepared to accept. This is because they simply repeat the market price developments due to their buy and hold tactics.<br />
For this reason, it might be interesting to try out an improved version of the fitness function which corrects for the number of trades, for example the <em>sum or product of the normalized total return and the normalized total number of trades</em>. Adding weights to each fitness component might also be of interest.</p>
<p><strong>Strategy depends on few winning trades</strong><br />
A related problem is to evolve strategies which rely heavily on a few very selected winning trades compared to many losing trades. Theoretically, this is a sound strategy, as long as the overall expected value is still positive (see further below). However, one needs to be very cautios. If the winning trades are too few, then the strategy might again not represent a repeatable way to success. In other words, the distribution of the winning vs. the losing trades is important too. If you went short the Dow Jones Industrial Average (DJIA) before the <a href="http://en.wikipedia.org/wiki/Black_Monday_(1987)">Black Monday of 19th Oct 1987</a> you could have made a fortune. (The DJIA dropped by 22.61% on that day alone.) If you have only a single trade like this in your backtested strategy it might still make up years of small losses. But since a movement of this magnitude statistically occurrs only very, very rarely you cannot rely on it strategically. (Nevertheless you must be prepared for it to happen on the downside, otherwise such a loss could effectively wipe you out.) Be aware that with the increasing frequency in the occurrence of <a href="http://en.wikipedia.org/wiki/Flash_crash">flash crashes</a> this problem has rather increased than decreased.</p>
<p><strong>Outliers dominate in fitness-proportional selection</strong><br />
Another problem is the existence of outliers. It happens relatively often that a few individuals are so much superior to all other individuals in the generation that they tend to dominate all the others although they do not represent a global optimum but only a local one. In case fitness-proportional selection operators instead of rank-based selection operators are used, outliers have much higher chances for being selected for reproduction. Re-running the evolutionary process with the same random seed (assuming a single-threaded application without runtime conditions) will of course just repeat the outcome, hence re-running with varied random seeds is recommended.<br />
This can happen both with single- as well as multi-objective fitness functions. Sometimes, in multi-objective fitness functions an individual’s fitness might be located at the very “border” of the fitness landscape, with for instance one fitness component being the maximum of all values and the other one being zero. It is often helpful to plot the relative distribution of the fitness values to get an impression.</p>
<h3 id="suggested-fitness-measures">Suggested fitness measures</h3>
<p>This is a list of fitness measures I personally consider worth trying out in combination. I’d probably start with single-objective fitness function only maximizing expected value. Later on, once multi-objective fitness function is in place, I’d add other fitness components too.</p>
<p><strong>Maximize expected value (EV)</strong>: The formula is <code>EV = Average Win * p<sub>Win</sub> - Average Loss * p<sub>Loss</sub></code>, with <em>Average Win (Loss)</em> being the average return of a winning (losing) trade, and <em>p<sub>Win</sub> (p<sub>Loss</sub>)</em> the probability for a winning (losing) trade. Of course the relationship p<sub>Win</sub> = 1.0 - p<sub>Loss</sub> must hold. The expected value <em>must</em> be a positive number - if it is not, then the system will surely lose money in the long run! I consider this to be the most straight-forward and intuitive measure to maximize for every trading system. (For zero-sum trades, i.e. trades that neither generate nor lose money, I’d count them as losers nevertheless, as there is nearly always a “risk-free interest” alternative to which the money could have been assigned to.)</p>
<p><strong>Minimize max drawdown</strong>: Nobody likes losing money in the markets. One common problem is to know when to shut down a trading strategy because it supposedly no longer works. Trading is thus always also a psychological game, do you really trust your strategy? Having big drawdowns might explode your account and lead to margin calls at the worst possible moment.</p>
<p><strong>Maximize number of trades</strong>: This is a tricky one that beginners might not come up with easily. Every trade has its fees. Sometimes GP might assign very high fitness functions to trading strategies with only very few (but all winning) trades. Such strategies are artifacts based on random behavior as they do not represent meaningful, reproducible trading success. Maximizing the number of trades in combination with perhaps expected value is a counter-measure against this problem. Nevertheless this measure might be problematic, as it gives preference to many short-termed trades. When using this fitness measure it is therefore imperative to account for trading fees and possibly slippage.</p>
<p><strong>Maximize total return</strong>: Although very simple and intuitive, maximizing total return is often only a good idea in combination with other fitness measures.</p>
<p><strong>Minimize GP tree size</strong>: This is a tricky one too. The stated goal is of course to prevent code bloat leading to overfitting. We will in a later article cover parsimony pressure. I personally believe that trying to punish complexity of GP tree rules in the fitness function does not lead to very good results. Of course, this measure could also be combined with others, e.g. <em>maximize total return divided by GP tree size</em>. Still I have my concerns. I believe that parsimony pressure techniques which “mechanically” prevent the construction of large GP decision trees are far superior to punishing complexity in fitness functions.</p>
<hr />
<h2 id="other-articles">Other articles</h2>
<h3 id="previous">Previous</h3>
<ul>
<li>Part 4: <a href="/2014/11/01/evolving-trading-strategies-with-genetic-programming-gp-parameters-and-operators/">Evolving Trading Strategies With Genetic Programming - GP Parameters and Operators</a></li>
</ul>
<h3 id="next">Next</h3>
<ul>
<li>Part 6: <a href="/2015/01/14/evolving-trading-strategies-with-genetic-programming-punishing-complexity/">Evolving Trading Strategies With Genetic Programming - Punishing Complexity</a></li>
</ul>
<hr />
<h1 id="references">References</h1>
Musings on Programming Paradigms2014-12-08T00:00:00+01:00http://fabian-kostadinov.github.io/2014/12/08/musings-on-programming-paradigms<p>Yesterday I had to fill out a questionnaire on my programming skills as part of a job application procedure. I was asked to name some GoF and JEE patterns I am familiar with, so I pondered on what I had learned for a little while. One thing that somehow struck me as odd was the silent underlying assumption that a “good programmer” nowadays is actually supposed to be familiar with these patterns. Singleton, MVC, Factory, Adapter, Fly-Weight, DAO and so on. After all, programming is a lot about robust design and improving code quality, isn’t it? And sure it is. There is nothing more frustrating than trying to untangle an old piece of software code horribly designed and possibly poorly documented. Using OOP patterns appropriately simply makes the world a better place. So, where is the problem?<span class="more"></span></p>
<p>The problem is: On a cloudy day in fall roughly two years ago I decided to learn the Clojure programming language. I am not an expert here, I hurry to add. But learning Clojure is like a <a href="http://en.wikipedia.org/wiki/Satori">Satori in Zen</a>. Once you’ve had one, the world is never going to be exactly the same. Clojure is different. Not only is it a functional programming language with a weird syntax (there are others like Scala or Python) but it is also <a href="http://en.wikipedia.org/wiki/Homoiconicity">homoiconic</a>. And there are macros. It is the combination that makes the difference.</p>
<p>If you haven’t yet stumbled upon them, I really recommend reading the following two articles on a quiet Sunday afternoon.</p>
<ul>
<li><a href="http://www.paulgraham.com/icad.html">Paul Graham - Revenge of the Nerds</a></li>
<li><a href="http://www.eecis.udel.edu/~decker/courses/280f07/paper/KingJava.pdf">Steve Yegge - Execution in the Kingdom of Nouns</a></li>
</ul>
<p>Of course, both pieces are a little polemic. Java is not bad. In fact, it is much better than many other programming languages existing before it. (Due to its simplicity and platform independency I personally prefer it over C++, for example. But that’s a matter of taste.) Yet, once you have an impression on how much less boiler plate code is actually needed in a functional language like Scala or Clojure, you start feeling envious. In Clojure you actually think about your function for a considerable amount of time, then you sit down and write a few lines of code so powerful that it would need up to 10x the amount of code in Java to achieve the same. You don’t believe me? <a href="http://redmonk.com/dberkholz/2013/03/25/programming-languages-ranked-by-expressiveness/">Read this article</a>.</p>
<p>In his article, Paul Graham concludes:</p>
<blockquote>When I see patterns in my programs, I consider it a sign of trouble. The shape of a program should reflect only the problem it needs to solve. Any other regularity in the code is a sign, to me at least, that I'm using abstractions that aren't powerful enough-- often that I'm generating by hand the expansions of some macro that I need to write.</blockquote>
<p>Here we are at the core of the problem. To me, using OOP patterns is often a sign of a programming language’s expressive weakness rather than an indicator of beautiful design. Many OOP patterns are simply obsolete when using a functional language, you don’t need them because the problem addressed by the design pattern typically does not or cannot arise. Of course this does not mean that I don’t use OOP patterns a lot. I certainly do, and I recommend every programmer doing so.<br />
I am also aware that Java 8 introduces its own version of <a href="https://docs.oracle.com/javase/tutorial/java/javaOO/lambdaexpressions.html">lambda expressions</a>. I remain a little skeptical. In a language like Clojure, the majority of my code is functional and only in certain places I deviate from the paradigm. In an OOP language like Java, the majority of code is object-oriented, even if in fact it makes little sense (as in <code>new Importer().import()</code> or or in <code>new MyFunction().execute()</code>). To me, having lambda expressions in Java is a step in the right direction and I will happily use them as soon as I can, but having lambda expressions still does not make Java a functional programming language. And <a href="http://www.beyondjava.net/blog/java-8-functional-programming-language/">I’m not the only person thinking this way</a>.</p>
<p>If you would like to have a glimpse of what is possible with functional programming languages in combination with NoSQL databases watch this video:</p>
<center><iframe src="//player.vimeo.com/video/15955920" width="500" height="281" frameborder="0" webkitallowfullscreen="" mozallowfullscreen="" allowfullscreen=""></iframe></center>
Factor Analysis2014-12-06T00:00:00+01:00http://fabian-kostadinov.github.io/2014/12/06/factor-analysis<p>There is a good series of “cowboy-style” youtube videos on factor analysis.<span class="more"></span></p>
<p><em>Factor Analysis - An Introduction (Part 1):</em></p>
<iframe width="560" height="315" src="//www.youtube.com/embed/MB-5WB3eZI8" frameborder="0" allowfullscreen=""></iframe>
<p><em>Factor Analysis - Assumptions (Part 2):</em></p>
<iframe width="560" height="315" src="//www.youtube.com/embed/fyStsHk5Ohs" frameborder="0" allowfullscreen=""></iframe>
<p><em>Factor Analysis - Research Questions (Part 3):</em></p>
<iframe width="560" height="315" src="//www.youtube.com/embed/l_uFfxdYzAA" frameborder="0" allowfullscreen=""></iframe>
<p><em>Factor Analysis - Using SPSS (Part 4):</em></p>
<iframe width="560" height="315" src="//www.youtube.com/embed/pRlJZ2nn8c4" frameborder="0" allowfullscreen=""></iframe>
<p><em>Factor Analysis - Interpreting the Readout (Part 5):</em></p>
<iframe width="560" height="315" src="//www.youtube.com/embed/eAl0nXkzt7w" frameborder="0" allowfullscreen=""></iframe>
<p><em>Factor Analysis - Examining Factor Loadings (Part 6):</em></p>
<iframe width="560" height="315" src="//www.youtube.com/embed/NsynBKI8sLo" frameborder="0" allowfullscreen=""></iframe>
<p>As I am a fan of the <a href="http://www.r-project.org/">R software</a> (or <a href="http://www.rstudio.com/">R Studio</a>, both of which are available as free software), here is a video on how to perform factor analysis in R by Edward Boone:</p>
<p><em>Exploratory Factor Analysis in R:</em></p>
<iframe width="560" height="315" src="//www.youtube.com/embed/Ilf1XR-K3ps" frameborder="0" allowfullscreen=""></iframe>
<p>In a separate video he also explains principal component analysis (PCA):</p>
<iframe width="560" height="315" src="//www.youtube.com/embed/Heh7Nv4qimU?list=UUP26dEI5nBLwt8xQk0LdnWg" frameborder="0" allowfullscreen=""></iframe>
<p>For the necessary mathematical background, I really enjoy <a href="https://www.youtube.com/user/SpartacanUsuals">Ben Lampert’s youtube videos on various statistics topics</a>. His explanations are just easy to understand and follow. Here are a few great introductory videos on factor analysis.</p>
<p><em>Factor Analysis - An Introduction</em></p>
<iframe width="420" height="315" src="//www.youtube.com/embed/WV_jcaDBZ2I" frameborder="0" allowfullscreen=""></iframe>
<p><em>Factor Analysis - Model Representation - Part 1</em></p>
<iframe width="420" height="315" src="//www.youtube.com/embed/TeIx7dRedkg" frameborder="0" allowfullscreen=""></iframe>
<p><em>Factor Analysis - Model Representation - Part 2</em></p>
<iframe width="420" height="315" src="//www.youtube.com/embed/mIx-qlI4mJo" frameborder="0" allowfullscreen=""></iframe>
<p><em>Factor Analysis - Model Representation - Part 3</em></p>
<iframe width="420" height="315" src="//www.youtube.com/embed/yWUKzEIeVWU" frameborder="0" allowfullscreen=""></iframe>
<p><em>Factor Analysis - Model Representation - Part 4</em></p>
<iframe width="420" height="315" src="//www.youtube.com/embed/-XBqfU4AzFE" frameborder="0" allowfullscreen=""></iframe>
<p>These four videos are just the introduction to the topic. There is a <a href="https://www.youtube.com/playlist?list=PLwJRxp3blEvaOTZfSKXysxRmi6gXJf5gP">whole playlist on factor analysis covering much more</a>.</p>
Implementing a Fixed-Length FIFO Queue in Java2014-11-25T00:00:00+01:00http://fabian-kostadinov.github.io/2014/11/25/implementing-a-fixed-length-fifo-queue-in-java<p>When working with time series data, one often needs to calculate sums of consecutive numbers for a predetermined time frame. Imagine for example calculating a moving average with a fixed size. Let’s look at a very simply time series.<span class="more"></span></p>
<figure class="highlight"><pre><code class="language-console" data-lang="console"><span class="go">[0 |1 |2 |3 |4 |5 |6 |7 ]</span></code></pre></figure>
<p>Assuming a moving average of length 4 results in the following array:</p>
<figure class="highlight"><pre><code class="language-console" data-lang="console"><span class="go">[x |x |x |1.5|2.5|3.5|4.5|5.5]</span></code></pre></figure>
<p>The formula for a moving average of length 4 thus is:<br />
<code>MA<sub>t</sub> = (Sum of all elements from t-3 to t) / 4</code>
How would we efficiently implement this in Java code? The problem is that we need to calculate the <em>Sum</em> in the formula for every moving average. Of course it is possible to always reiterate over all numbers in the current time frame to do so, but this is unnecessarily slow. Instead, we can simply subtract the last element in the time frame and add the newest one to <em>Sum</em>. In this way we can save a significant number of unnecessary computations. Still we have to keep track of what actually are the old and the new elements. We have to store these numbers somewhere. An appropriate data structure would be a first-in-first-out (FIFO) queue of numbers. But how exactly can a FIFO queue be implemented in a (non-functional) programming language such as Java? The first idea is typically to use an array-based implementation and to shift the position of elements in the array by repeatedly creating slightly shifted copies of the array. In the above example, we would need to create a new array five times, once for every new <em>Sum</em> being calculated. This is of course very inefficient, because creation of an array in memory is relatively slow. Implementations based on classes such as java.util.ArrayList or java.util.Vector are already much better, because internally they rely on longer arrays and indices. Still this is not the best solution, because once the internal indices move outside the internal array’s boundaries a new copy of the internal array must be created.</p>
<p>A typical alternative for implementing FIFO queues is thus using a linked list:</p>
<figure class="highlight"><pre><code class="language-console" data-lang="console"><span class="gp">[0-></span><span class="w"> </span>1-> 2-> 3-> 4-> 5-> 6-> 7]</code></pre></figure>
<p>The advantage is obvious, no more copying or re-creating arrays in memory. All we have to do is manipulating a few pointers. Of course we lose the advantage of directly assessing an element in the queue by index, but for our purpose - calculating moving averages - this is something we do not want to do anyway.</p>
<p>Yesterday, it suddenly occurred to me that there is actually an even better alternative if the length of the queue is fixed (as in our example). We can effectively use a <em>ring</em>.
<img src="/public/img/2014-11-25-ring-implementation-of-fixed-length-fifo-queue.png" alt="Ring Implementation of Fixed-Length FIFO Queue" title="Ring Implementation of Fixed-Lenght FIFO Queue" />
Adding a new number to the queue and dropping the oldest one is the same as simply replacing the oldest element in this ring with a new one. Internally, we can again use an array of a fixed length in combination with a rotating index. This is how the code looks like in Java. First, let’s create our own Queue interface:</p>
<figure class="highlight"><pre><code class="language-java" data-lang="java"><span class="kd">public</span> <span class="kd">interface</span> <span class="nc">Queue</span><span class="o"><</span><span class="no">E</span><span class="o">></span> <span class="o">{</span>
<span class="kd">public</span> <span class="kt">boolean</span> <span class="nf">add</span><span class="o">(</span><span class="no">E</span> <span class="n">e</span><span class="o">);</span>
<span class="kd">public</span> <span class="no">E</span> <span class="nf">element</span><span class="o">();</span>
<span class="kd">public</span> <span class="kt">boolean</span> <span class="nf">offer</span><span class="o">(</span><span class="no">E</span> <span class="n">e</span><span class="o">);</span>
<span class="kd">public</span> <span class="no">E</span> <span class="nf">peek</span><span class="o">();</span>
<span class="kd">public</span> <span class="no">E</span> <span class="nf">poll</span><span class="o">();</span>
<span class="kd">public</span> <span class="no">E</span> <span class="nf">remove</span><span class="o">();</span>
<span class="o">}</span></code></pre></figure>
<p>This interface deviates somewhat from the one provided in the Java libraries but this is unimportant for now. Next, the implementation of our queue:</p>
<figure class="highlight"><pre><code class="language-java" data-lang="java"><span class="kd">public</span> <span class="kd">class</span> <span class="nc">NumberFixedLengthFifoQueue</span> <span class="kd">implements</span> <span class="nc">Queue</span><span class="o"><</span><span class="nc">Number</span><span class="o">></span> <span class="o">{</span>
<span class="kd">protected</span> <span class="nc">Number</span><span class="o">[]</span> <span class="n">ring</span><span class="o">;</span>
<span class="kd">protected</span> <span class="kt">int</span> <span class="n">index</span><span class="o">;</span>
<span class="cm">/**
* @param initialValues contains the ring's initial values.
* The "oldest" value in the queue is expected to reside in
* position 0, the newest one in position length-1.
*/</span>
<span class="kd">public</span> <span class="nf">NumberFixedLengthFifoQueue</span><span class="o">(</span><span class="nc">Number</span><span class="o">[]</span> <span class="n">initialValues</span><span class="o">)</span> <span class="o">{</span>
<span class="c1">// This is a little ugly, but there are no</span>
<span class="c1">// generic arrays in Java</span>
<span class="n">ring</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">Number</span><span class="o">[</span><span class="n">initialValues</span><span class="o">.</span><span class="na">length</span><span class="o">];</span>
<span class="c1">// We don't want to work on the original data</span>
<span class="nc">System</span><span class="o">.</span><span class="na">arraycopy</span><span class="o">(</span><span class="n">initialValues</span><span class="o">,</span> <span class="mi">0</span><span class="o">,</span> <span class="n">ring</span><span class="o">,</span> <span class="mi">0</span><span class="o">,</span> <span class="n">initialValues</span><span class="o">.</span><span class="na">length</span><span class="o">);</span>
<span class="c1">// The next time we add something to the queue,</span>
<span class="c1">// the oldest element should be replaced</span>
<span class="n">index</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span>
<span class="o">}</span>
<span class="nd">@Override</span>
<span class="kd">public</span> <span class="kt">boolean</span> <span class="nf">add</span><span class="o">(</span><span class="nc">Number</span> <span class="n">newest</span><span class="o">)</span> <span class="o">{</span>
<span class="k">return</span> <span class="nf">offer</span><span class="o">(</span><span class="n">newest</span><span class="o">);</span>
<span class="o">}</span>
<span class="nd">@Override</span>
<span class="kd">public</span> <span class="nc">Number</span> <span class="nf">element</span><span class="o">()</span> <span class="o">{</span>
<span class="k">return</span> <span class="n">ring</span><span class="o">[</span><span class="n">getHeadIndex</span><span class="o">()];</span>
<span class="o">}</span>
<span class="nd">@Override</span>
<span class="kd">public</span> <span class="kt">boolean</span> <span class="nf">offer</span><span class="o">(</span><span class="nc">Number</span> <span class="n">newest</span><span class="o">)</span> <span class="o">{</span>
<span class="nc">Number</span> <span class="n">oldest</span> <span class="o">=</span> <span class="n">ring</span><span class="o">[</span><span class="n">index</span><span class="o">];</span>
<span class="n">ring</span><span class="o">[</span><span class="n">index</span><span class="o">]</span> <span class="o">=</span> <span class="n">newest</span><span class="o">;</span>
<span class="n">incrIndex</span><span class="o">();</span>
<span class="k">return</span> <span class="kc">true</span><span class="o">;</span>
<span class="o">}</span>
<span class="nd">@Override</span>
<span class="kd">public</span> <span class="nc">Number</span> <span class="nf">peek</span><span class="o">()</span> <span class="o">{</span>
<span class="k">return</span> <span class="n">ring</span><span class="o">[</span><span class="n">getHeadIndex</span><span class="o">()];</span>
<span class="o">}</span>
<span class="nd">@Override</span>
<span class="kd">public</span> <span class="nc">Number</span> <span class="nf">poll</span><span class="o">()</span> <span class="o">{</span>
<span class="k">throw</span> <span class="k">new</span> <span class="nf">IllegalStateException</span><span class="o">(</span><span class="s">"The poll method is not available for NumberFixedLengthFifoQueue."</span><span class="o">);</span>
<span class="o">}</span>
<span class="nd">@Override</span>
<span class="kd">public</span> <span class="nc">Number</span> <span class="nf">remove</span><span class="o">()</span> <span class="o">{</span>
<span class="k">throw</span> <span class="k">new</span> <span class="nf">IllegalStateException</span><span class="o">(</span><span class="s">"The remove method is not available for NumberFixedLengthFifoQueue."</span><span class="o">);</span>
<span class="o">}</span>
<span class="nd">@Override</span>
<span class="kd">public</span> <span class="nc">Number</span> <span class="nf">get</span><span class="o">(</span><span class="kt">int</span> <span class="n">absIndex</span><span class="o">)</span> <span class="kd">throws</span> <span class="nc">IndexOutOfBoundsException</span> <span class="o">{</span>
<span class="k">if</span> <span class="o">(</span><span class="n">absIndex</span> <span class="o">>=</span> <span class="n">ring</span><span class="o">.</span><span class="na">length</span><span class="o">)</span> <span class="o">{</span>
<span class="k">throw</span> <span class="k">new</span> <span class="nf">IndexOutOfBoundsException</span><span class="o">(</span><span class="s">"Invalid index "</span> <span class="o">+</span> <span class="n">absIndex</span><span class="o">);</span>
<span class="o">}</span>
<span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="n">index</span> <span class="o">+</span> <span class="n">absIndex</span><span class="o">;</span>
<span class="k">if</span> <span class="o">(</span><span class="n">i</span> <span class="o">>=</span> <span class="n">ring</span><span class="o">.</span><span class="na">length</span><span class="o">)</span> <span class="o">{</span>
<span class="n">i</span> <span class="o">-=</span> <span class="n">ring</span><span class="o">.</span><span class="na">length</span><span class="o">;</span>
<span class="o">}</span>
<span class="k">return</span> <span class="n">ring</span><span class="o">[</span><span class="n">i</span><span class="o">];</span>
<span class="o">}</span>
<span class="nd">@Override</span>
<span class="kd">public</span> <span class="nc">String</span> <span class="nf">toString</span><span class="o">()</span> <span class="o">{</span>
<span class="nc">StringBuffer</span> <span class="n">sb</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">StringBuffer</span><span class="o">(</span><span class="s">"["</span><span class="o">);</span>
<span class="k">for</span> <span class="o">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="n">index</span><span class="o">,</span> <span class="n">n</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span> <span class="n">n</span> <span class="o"><</span> <span class="n">ring</span><span class="o">.</span><span class="na">length</span><span class="o">;</span> <span class="n">i</span> <span class="o">=</span> <span class="n">nextIndex</span><span class="o">(</span><span class="n">i</span><span class="o">),</span> <span class="n">n</span><span class="o">++)</span> <span class="o">{</span>
<span class="n">sb</span><span class="o">.</span><span class="na">append</span><span class="o">(</span><span class="n">ring</span><span class="o">[</span><span class="n">i</span><span class="o">]);</span>
<span class="k">if</span> <span class="o">(</span><span class="n">n</span><span class="o">+</span><span class="mi">1</span> <span class="o"><</span> <span class="n">ring</span><span class="o">.</span><span class="na">length</span><span class="o">)</span> <span class="o">{</span> <span class="n">sb</span><span class="o">.</span><span class="na">append</span><span class="o">(</span><span class="s">", "</span><span class="o">);</span> <span class="o">}</span>
<span class="o">}</span>
<span class="k">return</span> <span class="n">sb</span><span class="o">.</span><span class="na">append</span><span class="o">(</span><span class="s">"]"</span><span class="o">).</span><span class="na">toString</span><span class="o">();</span>
<span class="o">}</span>
<span class="kd">protected</span> <span class="kt">void</span> <span class="nf">incrIndex</span><span class="o">()</span> <span class="o">{</span>
<span class="n">index</span> <span class="o">=</span> <span class="n">nextIndex</span><span class="o">(</span><span class="n">index</span><span class="o">);</span>
<span class="o">}</span>
<span class="kd">protected</span> <span class="kt">int</span> <span class="nf">nextIndex</span><span class="o">(</span><span class="kt">int</span> <span class="n">current</span><span class="o">)</span> <span class="o">{</span>
<span class="k">if</span> <span class="o">(</span><span class="n">current</span> <span class="o">+</span> <span class="mi">1</span> <span class="o">>=</span> <span class="n">ring</span><span class="o">.</span><span class="na">length</span><span class="o">)</span> <span class="o">{</span> <span class="k">return</span> <span class="mi">0</span><span class="o">;</span> <span class="o">}</span>
<span class="k">else</span> <span class="k">return</span> <span class="n">current</span> <span class="o">+</span> <span class="mi">1</span><span class="o">;</span>
<span class="o">}</span>
<span class="kd">protected</span> <span class="kt">int</span> <span class="nf">previousIndex</span><span class="o">(</span><span class="kt">int</span> <span class="n">current</span><span class="o">)</span> <span class="o">{</span>
<span class="k">if</span> <span class="o">(</span><span class="n">current</span> <span class="o">-</span> <span class="mi">1</span> <span class="o"><</span> <span class="mi">0</span><span class="o">)</span> <span class="o">{</span> <span class="k">return</span> <span class="n">ring</span><span class="o">.</span><span class="na">length</span> <span class="o">-</span> <span class="mi">1</span><span class="o">;</span> <span class="o">}</span>
<span class="k">else</span> <span class="k">return</span> <span class="n">current</span> <span class="o">-</span> <span class="mi">1</span><span class="o">;</span>
<span class="o">}</span>
<span class="kd">protected</span> <span class="kt">int</span> <span class="nf">getHeadIndex</span><span class="o">()</span> <span class="o">{</span>
<span class="k">if</span> <span class="o">(</span><span class="n">index</span> <span class="o">==</span> <span class="mi">0</span><span class="o">)</span> <span class="o">{</span> <span class="k">return</span> <span class="n">ring</span><span class="o">.</span><span class="na">length</span><span class="o">-</span><span class="mi">1</span><span class="o">;</span> <span class="o">}</span>
<span class="k">else</span> <span class="k">return</span> <span class="n">index</span><span class="o">-</span><span class="mi">1</span><span class="o">;</span>
<span class="o">}</span>
<span class="o">}</span></code></pre></figure>
<p>The queue “rolls” through the ring. Adding a new element at the head of the queue automatically removes the oldest element in the queue - no copying of arrays or resetting of object references required. Unlike with linked lists we can actually access each element in the ring directly with the <code>get</code> method. Finally, we can create a subclass of our queue object which will graciously roll over as new values are added into the queue/ring.</p>
<figure class="highlight"><pre><code class="language-java" data-lang="java"><span class="kd">public</span> <span class="kd">class</span> <span class="nc">RollingMovingAverage</span> <span class="kd">extends</span> <span class="nc">NumberFixedLengthFifoQueue</span> <span class="o">{</span>
<span class="kd">private</span> <span class="kt">float</span> <span class="n">maNumerator</span><span class="o">;</span>
<span class="kd">private</span> <span class="kt">float</span> <span class="n">maValue</span><span class="o">;</span>
<span class="kd">public</span> <span class="nf">RollingMovingAverage</span><span class="o">(</span><span class="nc">Number</span><span class="o">[]</span> <span class="n">initialValues</span><span class="o">)</span> <span class="o">{</span>
<span class="kd">super</span><span class="o">(</span><span class="n">initialValues</span><span class="o">);</span>
<span class="n">maNumerator</span> <span class="o">=</span> <span class="mf">0.0f</span><span class="o">;</span>
<span class="n">maValue</span> <span class="o">=</span> <span class="mf">0.0f</span><span class="o">;</span>
<span class="n">initialize</span><span class="o">();</span>
<span class="o">}</span>
<span class="kd">public</span> <span class="kt">float</span> <span class="nf">getValue</span><span class="o">()</span> <span class="o">{</span>
<span class="k">return</span> <span class="n">maValue</span><span class="o">;</span>
<span class="o">}</span>
<span class="nd">@Override</span>
<span class="kd">public</span> <span class="kt">boolean</span> <span class="nf">add</span><span class="o">(</span><span class="nc">Number</span> <span class="n">newest</span><span class="o">)</span> <span class="o">{</span>
<span class="k">return</span> <span class="k">this</span><span class="o">.</span><span class="na">offer</span><span class="o">(</span><span class="n">newest</span><span class="o">);</span>
<span class="o">}</span>
<span class="nd">@Override</span>
<span class="kd">public</span> <span class="kt">boolean</span> <span class="nf">offer</span><span class="o">(</span><span class="nc">Number</span> <span class="n">newest</span><span class="o">)</span> <span class="o">{</span>
<span class="n">maNumerator</span> <span class="o">-=</span> <span class="n">ring</span><span class="o">[</span><span class="n">index</span><span class="o">].</span><span class="na">floatValue</span><span class="o">();</span>
<span class="kt">boolean</span> <span class="n">res</span> <span class="o">=</span> <span class="kd">super</span><span class="o">.</span><span class="na">offer</span><span class="o">(</span><span class="n">newest</span><span class="o">);</span>
<span class="n">maNumerator</span> <span class="o">+=</span> <span class="n">ring</span><span class="o">[</span><span class="n">getHeadIndex</span><span class="o">()].</span><span class="na">floatValue</span><span class="o">();</span>
<span class="n">maValue</span> <span class="o">=</span> <span class="n">maNumerator</span> <span class="o">/</span> <span class="o">(</span><span class="kt">float</span><span class="o">)</span> <span class="n">ring</span><span class="o">.</span><span class="na">length</span><span class="o">;</span>
<span class="k">return</span> <span class="n">res</span><span class="o">;</span>
<span class="o">}</span>
<span class="kd">private</span> <span class="kt">void</span> <span class="nf">initialize</span><span class="o">()</span> <span class="o">{</span>
<span class="k">for</span> <span class="o">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="n">previousIndex</span><span class="o">(</span><span class="n">index</span><span class="o">),</span> <span class="n">n</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span> <span class="n">n</span> <span class="o"><</span> <span class="n">ring</span><span class="o">.</span><span class="na">length</span><span class="o">;</span> <span class="n">i</span> <span class="o">=</span> <span class="n">previousIndex</span><span class="o">(</span><span class="n">i</span><span class="o">),</span> <span class="n">n</span><span class="o">++)</span> <span class="o">{</span>
<span class="n">maNumerator</span> <span class="o">+=</span> <span class="n">ring</span><span class="o">[</span><span class="n">i</span><span class="o">].</span><span class="na">floatValue</span><span class="o">();</span>
<span class="o">}</span>
<span class="n">maValue</span> <span class="o">=</span> <span class="n">maNumerator</span> <span class="o">/</span> <span class="o">(</span><span class="kt">float</span><span class="o">)</span> <span class="n">ring</span><span class="o">.</span><span class="na">length</span><span class="o">;</span>
<span class="o">}</span>
<span class="o">}</span></code></pre></figure>
<p>We can use the class now. The length of the moving average is initially set through the length of the array given to its constructor.</p>
<figure class="highlight"><pre><code class="language-java" data-lang="java"><span class="nc">Integer</span><span class="o">[]</span> <span class="n">initialMovAvgFrame</span> <span class="o">=</span> <span class="o">{</span> <span class="mi">0</span><span class="o">,</span> <span class="mi">1</span><span class="o">,</span> <span class="mi">2</span><span class="o">,</span> <span class="mi">3</span> <span class="o">};</span>
<span class="nc">RollingMovingAverage</span> <span class="n">ma</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">RollingMovingAverage</span><span class="o">(</span><span class="n">initialMovAvgFrame</span><span class="o">);</span>
<span class="n">ma</span><span class="o">.</span><span class="na">getValue</span><span class="o">();</span> <span class="c1">// returns 1.5</span>
<span class="n">ma</span><span class="o">.</span><span class="na">add</span><span class="o">(</span><span class="mi">4</span><span class="o">);</span>
<span class="n">ma</span><span class="o">.</span><span class="na">getValue</span><span class="o">();</span> <span class="c1">// returns 2.5</span>
<span class="n">ma</span><span class="o">.</span><span class="na">add</span><span class="o">(-</span><span class="mi">1</span><span class="o">);</span>
<span class="n">ma</span><span class="o">.</span><span class="na">getValue</span><span class="o">();</span> <span class="c1">// returns 2</span></code></pre></figure>
Enova International Inc. - A Very Attractive Spin-Off Opportunity2014-11-20T00:00:00+01:00http://fabian-kostadinov.github.io/2014/11/20/enova-international-inc-a-very-attractive-spin-off-opportunity<p>Yesterday, my first article was accepted for publication on <a href="http://www.seekingalpha.com">SeekingAlpha.com</a>, which - I must admit - made me quite proud. Not only that, but the article even received an <em>Editor’s Pick</em>, which means that the editor(s) specially liked it. The article is about <a href="http://www.enova.com">Enova International Inc.</a> which just spun off from its parent <a href="http://www.cashamerica.com/">Cash America Internatioal Inc.</a>, and which I consider to be a great buy opportunity right now.<span class="more"></span> The corresponding reasons are given in said article. As the article is accepted as a SeekingAlpha PRO article, this prohibits me to post it on my blog. The article should be made available also to non-PRO SeekingAlpha members (for which a membership is free) in the next 24 hours for a 30 day time frame. After this time window it will again be only accessible to SeekingAlpha PRO subscribers. Here’s the corresponding link: <a href="http://seekingalpha.com/article/2695535-enova-international-inc-a-very-attractive-spin-off-opportunity">http://seekingalpha.com/article/2695535-enova-international-inc-a-very-attractive-spin-off-opportunity</a></p>
Installation of Jekyll-Auth2014-11-13T00:00:00+01:00http://fabian-kostadinov.github.io/2014/11/13/installation-of-jekyll-auth<p>These are (hopefully) complete installation instructions for Jekyll-Auth. To understand how Jekyll-Auth works, you need a conceptual understanding on how <a href="https://github.com/adaoraul/rack-jekyll/">Rack-Jekyll</a> and <a href="https://github.com/rack/rack">Rack</a> work. The next figure shows the conceptual workflow of Jekyll-Auth in combination with a repository on GitHub.com. This is how Jekyll-Auth works.<span class="more"></span></p>
<p><img src="/public/img/2014-11-13-installation-of-jekyll-auth.png" alt="Jekyll-Auth Workflow" title="Jekyll-Auth Workflow" /></p>
<ul>
<li>On GitHub.com, there exists an organization <em>foo-organization</em> containing a team <em>foo-team</em> and a repository <em>foo-repository</em>. <em>foo-repository</em> is private, therefore only members of <em>foo-team</em> can view or modify its contents.</li>
<li>Although the repository contains a complete Jekyll-enabled website, GitHub pages is not used to host any of this content. Instead, all users only work on the <code>master</code> branch (or any other branch except a GitHub pages <code>gh-pages</code> branch).</li>
<li>The website’s content is not accessible publicly. Instead, on Heroku an app is running asking for user authentication when someone tries to access the website. This Heroku app is a Jekyll server enhanced with Jekyll-Auth functionality.</li>
<li>In the example, there are currently two users pushing and pulling to <em>foo-repository</em> on GitHub.com: Anna and Bob. Both users are members of <em>foo-organization</em> and of <em>foo-team</em>.</li>
<li>Bob is an ordinary GitHub user. He works on a local clone of <em>foo-repository</em>. Whenever he has changes, he simply pushes them to the remote GitHub repository.</li>
<li>Anna is also a GitHub user, hence she has her own local clone of <em>foo-repository</em> like Bob. However, unlike Bob, she is also responsible for hosting the website’s content on Heroku so that it is really accessible as a fully functional website (and not only as files) to all members of <em>foo-team</em>.</li>
<li>The local version of the GitHub repository will therefore contain both the common website files and the special files provided by Jekyll-Auth. Bob does not care much about these additional files. But Anna has two responsibilities. Like Bob she can work on the website’s content. But at certain predefined points in time, she can also update the hosted website running on Jekyll as a Heroku app.</li>
<li>Heroku is <em>not</em> notified of changes going on inside <em>foo-repository</em> on GitHub. Anna needs to manually perform a <code>git push heroku master</code> every time she wants to update the hosted website on Heroku.</li>
<li>The hosted website is accessible under the URL <code>https://my-foo-heroku.herokuapp.com</code>. Whenever a user wants to access this site, Heroku redirects him to a GitHub authorization page, where he must provide his GitHub username and password. Heroku also sends Client ID, Client Secret as well as organization ID and/or team ID (e.g. “@foo-organization/foo-team”) to GitHub. GitHub now tries to authenticate the user, and if he is a member of <em>foo-organization</em> and <em>foo-team</em> and is allowed to access <em>foo-repository</em>, then he is granted access to the website’s content. In the example, Charles is no organization/team member and is denied access to the website.</li>
<li>Access rights to files and folders are specified inside the <em>foo-repository</em>’s <em>_config.yml</em> file.</li>
</ul>
<p>#Installation instructions</p>
<p>This is a step-by-step installation instruction.</p>
<hr />
<p><em>Prerequisites:</em> Before you begin, you will need a repository that contains your website’s content, that is all the HTML and CSS files, JavaScript code, your images etc. Make sure your local clone is up to date with your remote GitHub repository. I will call this your <em>website repository</em>.</p>
<hr />
<p><em>Step 1:</em> Make sure you have a <a href="http://www.heroku.com">Heroku account</a>. A free one will be sufficient for most needs.</p>
<hr />
<p><em>Step 2:</em> Make sure you have <a href="https://toolbelt.heroku.com/">Heroku Toolbelt</a> installed. You will probably need to use your Heroku login information. Further down, you will run <code>bundle install</code>. If you haven’t Heroku Toolbelt installed but instead use Heroku gem, the installation will not work properly and you will receive this warning message:</p>
<figure class="highlight"><pre><code class="language-console" data-lang="console"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
</pre></td><td class="code"><pre><span class="go">Your bundle is complete!
Use `bundle show [gemname]` to see where a bundled gem is installed.
Post-install message from heroku:
! The `heroku` gem has been deprecated and replaced with the Heroku Toolbelt.
! Download and install from: https://toolbelt.heroku.com
! For API access, see: https://github.com/heroku/heroku.rb</span>
</pre></td></tr></tbody></table></code></pre></figure>
<p>If this shows up, you need to first uninstall Heroku gem: <code>gem uninstall heroku</code>. Heroku gem is deprecated and it will interfere with your Heroku Toolbelt installation, so make sure you actually uninstalled it.</p>
<hr />
<p><em>Step 3:</em> If you’ve just installed Heroku Toolbelt, you will probably have to recreate SSH keys, otherwise your local Heroku Toolbelt will not be able to push files to the Heroku server. There is this <a href="https://devcenter.heroku.com/articles/keys">article at heroku.com on the use of SSH keys</a> if you want to know more about this. Create a key <code>ssh-keygen -t rsa</code>, then add the key to Heroku <code>heroku keys:add</code>. Make sure that you <strong>do not mistakenly publish your private RSA key file</strong> together with the rest of the website!</p>
<hr />
<p><em>Step 4:</em> Use Heroku Toolbelt to <a href="https://devcenter.heroku.com/articles/creating-apps">create a new Heroku app</a>: <code>heroku create my-new-cool-heroku-app</code>. Your website will be available at <code>https://my-new-cool-heroku-app.herokuapp.com</code>, and it will have a Heroku git account to push to called <code>git@heroku.com:my-new-cool-heroku-app.git</code>. You have to understand thus that, when inside the repository and once you have added files and committed them with <code>git commit -m "Blah"</code>, you can either push to the Heroku git account using the command <code>git push heroku master</code> or to the GitHub remote repository by using the command <code>git push</code>.</p>
<p>Important: Creating a Heroku app in this way will automatically deploy it to servers in the USA. <a href="https://devcenter.heroku.com/articles/app-migration">As is explained in this tutorial on devcenter.herokuapp.com</a>, in case you want the app to run on servers located in Europe, you must first create a new dummy app, create a fork with the corrected region and then delete the dummy app.</p>
<figure class="highlight"><pre><code class="language-shell-session" data-lang="shell-session"><span class="gp">#</span><span class="w"> </span>First create a dummy app running on US servers
<span class="go">heroku create my-dummy-app-in-us
</span><span class="gp">#</span><span class="w"> </span>The next line will create fork of the first app, but running on Heroku servers <span class="k">in </span>Europe
<span class="go">heroku fork -a my-dummy-app-in-us my-new-cool-heroku-app --region eu
</span><span class="gp">#</span><span class="w"> </span>Don<span class="s1">'t forget to delete the dummy app afterwards
</span><span class="go">heroku apps:destroy my-dummy-app-in-us
! WARNING: Potentially Destructive Action
! This command will destroy my-dummy-app-in-us (including all add-ons).
! To proceed, type "my-dummy-app-in-us" or re-run this command with --confirm my-dummy-app-in-us
</span><span class="gp">></span><span class="w"> </span>my-dummy-app-in-us
<span class="go">Destroying my-dummy-app-in-us (including all add-ons)... done</span></code></pre></figure>
<hr />
<p><em>Step 5:</em> The Heroku app will access the GitHub account to perform an authorization check for every user. If the user is registered with the corresponding GitHub account, she will also be allowed to access the Heroku app. Hence, the Heroku app must be registered with GitHub. Upon registration, you will receive a OAuth2 Client ID and Client Secret which will be needed at a later step.</p>
<p>Login to your organization’s GitHub account, i.e. something like <code>https://github.com/organizations/foo-organization/settings/applications</code>. Click on <em>Register new application</em>. Enter the following information:</p>
<ul>
<li><em>Application name:</em> Any meaningful name for this application.</li>
<li><em>Homepage URL:</em> The link to your heroku app received in step 3, e.g. <code>https://my-new-cool-heroku-app.herokuapp.com</code>.</li>
<li><em>Application description:</em> A textual description.</li>
<li><em>Authorization callback URL:</em> Same as homepage url + <code>/auth/github/callback</code> appended, e.g. <code>https://my-new-cool-heroku-app.herokuapp.com/auth/github/callback</code></li>
</ul>
<p>Attention: The correct Heroku URL necessarily starts with <code>https://...</code> and not with <code>http://...</code>.</p>
<p>You will be given a Client ID and a Client Secret, that is a shorter and a longer string of numbers and letters. We will need them later on, so you better write them down. In case you want to know what they are useful for, here’s a short excerpt from <a href="https://developer.github.com/v3/oauth/">OAuth’s API description</a>:</p>
<blockquote>OAuth2 is a protocol that lets external apps request authorization to private details in a user's GitHub account without getting their password. This is preferred over Basic Authentication because tokens can be limited to specific types of data, and can be revoked by users at any time.
All developers need to register their application before getting started. A registered OAuth application is assigned a unique Client ID and Client Secret. The Client Secret should not be shared.</blockquote>
<hr />
<p><em>Step 6:</em> Make sure you have Ruby installed. Jekyll-Auth depends on Ruby (and other stuff).</p>
<hr />
<p><em>Step 7:</em> Make sure you have <a href="http://bundler.io/">Ruby’s bundler</a> installed. (To check if you have, simply call <code>bundle --version</code> in your shell.)</p>
<hr />
<p><em>Step 8:</em> Download the content of <a href="https://github.com/benbalter/jekyll-auth">the original Jekyll-Auth repository from GitHub</a>. Whether you create a local clone or download it as a zip-file does not matter. copy all the files from the downloaded Jekyll-Auth repo (except the hidden <em>.git</em> directory and everything contained in it) into your local website repository. The local website repository now contains both your website content and the content obtained from Jekyll-Auth. (If you created a local clone of the Jekyll-Auth repository, you can delete it now. It will no longer be used.)</p>
<hr />
<p><em>Step 9:</em> Navigate to your local website repository. There should be a <code>Gemfile</code> in your repository’s directory. Change this file so that it looks like this:</p>
<figure class="highlight"><pre><code class="language-ruby" data-lang="ruby"><span class="n">source</span> <span class="s2">"https://rubygems.org"</span>
<span class="n">gem</span> <span class="s1">'jekyll-auth'</span></code></pre></figure>
<p>If you receive an error message stating ‘certificate verify failed’ this refers to using <em>https</em> instead of <em>http</em> in your Gemfile. In case you don’t care about a secured connection, change the Gemfile like this:</p>
<figure class="highlight"><pre><code class="language-ruby" data-lang="ruby"><span class="n">source</span> <span class="s2">"http://rubygems.org"</span>
<span class="n">gem</span> <span class="s1">'jekyll-auth'</span></code></pre></figure>
<p>Another problem suddenly occurred connected to this step, and it took me quite some time to figure out what the problem was. At one point pushing my local repository to Heroku with <code>git push heroku master</code> suddenly failed, the error message said something about a method <code>ssl?</code> being not found. After searching for a significant amount of time I found <a href="https://devcenter.heroku.com/articles/bundler-windows-gemfile">an article on how to deploy Ruby applications to Heroku from a Windows environment</a>. The problem is that if you are working on a Windows machine (as I do) then you <em>must</em> specify a version in your <em>Gemfile</em>. Your entry should thus look similar to this:</p>
<figure class="highlight"><pre><code class="language-ruby" data-lang="ruby"><span class="n">source</span> <span class="s2">"https://rubygems.org"</span>
<span class="n">gem</span> <span class="s1">'jekyll-auth'</span><span class="p">,</span> <span class="s1">'0.6.1'</span></code></pre></figure>
<p>Heroku is running a Linux environment, and some required libraries differ from the ones used in Windows. If you don’t specify a version, then <code>git push heroku master</code> will call <code>rake assets:precompile</code> which again will call the <code>bundle exec jekyll-auth build</code> command. The Gemfile specified locally in your Windows environment will create a <em>Gemfile.lock</em> file, which contains a list of the exact versions used locally. But this file will not be submitted from your local Windows computer to Heroku’s Linux remote environment, only the Gemfile will be used. By default, the Gemfile simply downloads the most current version of Jekyll-Auth from somewhere, which can be a different version than the one you used locally. This behavior can lead to all sorts of strange compilation problems. Hence, you really should specify a Jekyll-Auth version. You can actually look on your local computer which versions of Jekyll-Auth are installed. On my machine I can find all different versions under <code>C:\Ruby200-x64\lib\ruby\gems\2.0.0\gems</code>.</p>
<hr />
<p><em>Step 10:</em> Then run <code>bundle install</code>. You might see a warning that <code>DL is deprecated, please use Fiddle</code>, which you can safely ignore.</p>
<hr />
<p><em>Step 11a:</em> Still inside your local clone’s directory, you can now run <code>jekyll-auth new</code> to create a new Heroku app. Follow all these steps.</p>
<hr />
<p><em>Step 11b:</em></p>
<figure class="highlight"><pre><code class="language-console" data-lang="console"><span class="c">...
</span><span class="go">Would you like to set up Heroku now? (Y/n)</span></code></pre></figure>
<p>Type <code>Y</code> and hit <code>Enter</code>.</p>
<hr />
<p><em>Step 11c:</em></p>
<figure class="highlight"><pre><code class="language-console" data-lang="console"><span class="go">If you already created an app, enter it's name
otherwise, hit enter, and we'll get you set up with one.
Heroku App name?</span></code></pre></figure>
<p>We have already created a Heroku app in step 3 with the name <em>my-new-cool-heroku-app</em>. Type <code>my-new-cool-heroku-app</code> and hit <code>Enter</code>.</p>
<hr />
<p><em>Step 11d:</em></p>
<figure class="highlight"><pre><code class="language-console" data-lang="console"><span class="c">...
</span><span class="go">Git remote heroku added
Awesome. Let's teach Heroku about our GitHub app.
What's your GitHub Client ID?</span></code></pre></figure>
<p>Here we need to enter the GitHub OAuth Client ID. Copy and paste the Client ID you received in step 5.</p>
<hr />
<p><em>Step 11e:</em></p>
<figure class="highlight"><pre><code class="language-console" data-lang="console"><span class="c">...
</span><span class="go">What's your GitHub Client Secret?</span></code></pre></figure>
<p>Then enter the GitHub Client Secret. Copy and paste the Client Secret you received in step 5.</p>
<hr />
<p><em>Step 11f:</em></p>
<figure class="highlight"><pre><code class="language-console" data-lang="console"><span class="c">...
</span><span class="go">What's your GitHub Team ID?</span></code></pre></figure>
<p>Enter the GitHub Team ID. The team id is an integer number consisting of roughly six or seven digits, like <code>1234567</code>. Using the team’s name will <em>not</em> work! I have created <a href="/2015/01/16/how-to-find-a-github-team-id/">an extra post how to find your team id</a>. Be aware that you <em>cannot</em> use a private (paid or unpaid) account’s username, it <em>must</em> be a team created with an organizational account.</p>
<hr />
<p><em>Step 12:</em>
We are not yet ready to push our local clone of Jekyll-Auth to the remote Heroku server. We still need to add the Gemfile.lock to the repository:</p>
<figure class="highlight"><pre><code class="language-console" data-lang="console"><span class="go">git add -f Gemfile.lock
git commit -m "Added Gemfile.lock"</span></code></pre></figure>
<p>Be aware that we use the <code>-f</code> parameter to enforce adding this file. If you do not provide the parameter git might refuse to add the file to the repository because it is actually ignored in <code>.gitignore</code>. In this case you might end up with the following error message:</p>
<figure class="highlight"><pre><code class="language-shell-session" data-lang="shell-session"><span class="go">git push heroku master
Counting objects: 46, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (37/37), done.
Writing objects: 100% (46/46), 10.14 KiB, done.
Total 46 (delta 6), reused 0 (delta 0)
remote: Compressing source files... done.
remote: Building source:
remote:
</span><span class="gp">remote: -----></span><span class="w"> </span>Ruby app detected
<span class="gp">remote: -----></span><span class="w"> </span>Compiling Ruby/NoLockfile
<span class="go">remote: !
remote: ! Gemfile.lock required. Please check it in.
remote: !
remote:
remote: ! Push rejected, failed to compile Ruby app
remote:
remote: Verifying deploy...
remote:
remote: ! Push rejected to edb-website.
remote:
To https://git.heroku.com/my-new-cool-herokuapp.git
</span><span class="gp"> ! [remote rejected] master -></span><span class="w"> </span>master <span class="o">(</span>pre-receive hook declined<span class="o">)</span>
<span class="go">error: failed to push some refs to 'https://git.heroku.com/my-new-cool-herokuapp.git'</span></code></pre></figure>
<hr />
<p><em>Step 13:</em>
In the <em>_config.yml</em> file you can specify which directories and files are accessible without authentication and which are not. By default, all access requires authorization except for the <em>drafts</em> directory. You can actually use regular expressions to specify the files and directories. The following denies access to all parts of the site except for the <em>drafts</em> directory.</p>
<figure class="highlight"><pre><code class="language-yaml" data-lang="yaml"><span class="na">jekyll_auth</span><span class="pi">:</span>
<span class="na">whitelist</span><span class="pi">:</span>
<span class="pi">-</span> <span class="s">drafts?</span></code></pre></figure>
<p>Using regexes, you can also reverse the logic, allowing access to everything except <em>drafts</em>:</p>
<figure class="highlight"><pre><code class="language-yaml" data-lang="yaml"><span class="na">jekyll_auth</span><span class="pi">:</span>
<span class="na">whitelist</span><span class="pi">:</span>
<span class="pi">-</span> <span class="s2">"</span><span class="s">^((?!draft).)*$"</span></code></pre></figure>
<hr />
<p><em>Step 14:</em>
Now we are finally ready to push everything to the remote Heroku server: <code>git push heroku master</code>.</p>
<hr />
<p><em>Step 15:</em>
Open a browser and navigate to the Heroku URL <code>https://my-new-cool-herokuapp.herokuapp.com</code>. You should be automatically redirected to a GitHub page asking for authorization: <code>Authorize application - my-new-cool-new-heroku-app by @foo-organization/foo-team - would like permission to access your account</code>. You can click on <code>Authorize application</code>.</p>
<hr />
<p>Hope this helped. I ran into a <a href="https://github.com/benbalter/jekyll-auth/issues/36">few problems which I discussed here</a>.</p>
Error Correction Model2014-11-10T00:00:00+01:00http://fabian-kostadinov.github.io/2014/11/10/error-correction-model<p>Here are two great introductory videos on error correction models.<span class="more"></span></p>
<p><em>Part 1:</em></p>
<iframe width="420" height="315" src="//www.youtube.com/embed/wYQ_v_0tk_c" frameborder="0" allowfullscreen=""></iframe>
<p><em>Part 2:</em></p>
<iframe width="420" height="315" src="//www.youtube.com/embed/xVIkb-QeZ40" frameborder="0" allowfullscreen=""></iframe>
Testing Time Series for Cointegration2014-11-04T00:00:00+01:00http://fabian-kostadinov.github.io/2014/11/04/testing-timeseries-for-cointegration<p>Cointegration is an important concept when dealing with time series data. Here’s the corresponding <a href="http://en.wikipedia.org/wiki/Cointegration">definition on Wikipedia</a>:</p>
<blockquote>Cointegration is a statistical property of time series variables. Two or more time series are cointegrated if they share a common stochastic drift.</blockquote>
<p><span class="more"></span></p>
<p>In other (rather non-scientific) words, if both time series are non-stationary <em>and</em> they share a trend together (which can be explained through the existence of a common cause), then they are <em>cointegrated</em>.</p>
<p>Cointegration is not the same as correlation!</p>
<iframe width="560" height="315" src="//www.youtube.com/embed/vrryb49jbIo" frameborder="0" allowfullscreen=""></iframe>
<div class="message">
<em>Correlation</em> measures the co-movement between two time series, it answers the question: How much do they move together? It does not guarantee that the two measures stay close to each other in the long-run.<br />
<br />
<em>Cointegration</em> means that two time series will not deviate substantially from each other, yet when they do, the gap will be closed sooner or later again.</div>
<p>For this reason, it is certainly possible for two time series to be correlated but not cointegrated, cointegrated but not correlated, both or none.</p>
<p>Cointegration is an often encountered feature of economic or financial time series. A typical text book example is a country’s consumption and income. The more people earn, the more they have left to consume - of course assuming stable prices. Both time series usually grow over time. We would therefore assume consumption and income to be cointegrated time series.</p>
<p>Or consider an investor who wants to build a portfolio. For cointegrated stocks, a significant deviation between the two stocks will soon close again. An example would probably be gold and silver prices. If the two deviate significantly from each other, an arbitrage opportunity exists. (Someone could buy the relatively cheaper metal and sell the relatively more expensive metal, waiting for the gap to close again.) There are some really interesting articles out there on this topic, see the references section.</p>
<p>What we need is a statistical test for cointegration. There are different such tests, but the most common one is probably the <a href="http://en.wikipedia.org/wiki/Augmented_Dickey%E2%80%93Fuller_test">Augmented Dickey-Fuller (ADF) test</a>. The ADF test returns a negative value. The more negative this value is, the higher the probability that the null hypothesis - “There is no cointegration present in the compared time series.” - can be rejected. Whereas the ADF test is available for nearly all statistics software, unfortunately there is no simple Excel formula for it. (There is however an <a href="http://www.web-reg.de/adf_addin.html">AddIn provided by Kurt Annen</a>.)</p>
<p>For the statistics software R, there is a great introductory article written by <a href="http://quanttrader.info/public/">Paul Teetor</a> available at <a href="http://quanttrader.info/public/testForCoint.html">http://quanttrader.info/public/testForCoint.html</a>. Besides explaining how to calculate an ADF test, it also shows all the steps how to import your data into R from a CSV file and how to prepare it for analysis.</p>
<p>This involves three steps.</p>
<p>First, we calculate a measure for the “co-movement” of both series. For this purpose, we use a simple linear regression formula between the two time series. It does not really matter which one is selected as the “dependent” and which one as the “independent” series, because we do not claim that there exists a “dependency relation” between the two. Be aware that we are not interested in the intercept, but only in the <em>beta (β)</em>, that is the regression coefficient. This <em>beta</em> tells us something about how strongly a change in one time series is accompanied by a corresponding change in the other time series. Therefore, we want to solve the following regression formula:</p>
<p><code>X<sub>time series 1</sub> = (-β) * X<sub>time series 2</sub></code></p>
<p>In R, we can use the <code>lm</code> function to solve this regression formula, in Excel 2013 we can perform a regression analysis (under Data -> Data Analysis -> Regression).</p>
<p>Second, we can calculate a new time series of “spreads” or “differences” between values of the two original time series using the formula:</p>
<p><code>spread<sub>t</sub> = x<sub>t, time series 1</sub> - β * x<sub>t, time series 2</sub></code></p>
<p>Third, we apply the ADF test on the new time series of spreads. Our null hypothesis is: “The spread time series is not-stationary.” If we can reject this null hypothesis at a, let’s say, 95% level, then we can accept the alternative hypothesis: “The spread time series is indeed stationary.” In R, there is for instance the <code>adf.test</code> function.</p>
<p><code>adf.test(spreads, alternative="stationary", k=0)</code></p>
<p>The function returns a <em>Dickey Fuller</em> statistical value, and, thankfully, also a probability value <em>p-value</em> which can be interpreted more easily. If the <em>p-value</em> is < 0.05 (critical 5% threshold) then the spread is likely to be mean-reverting, which means that the two time series are likely to be cointegrated. Otherwise the spread is not mean-reverting, thus the two time series are unlikely to be cointegrated.</p>
<hr />
<h1 id="references">References</h1>
<p>This is a very good description on what cointegration between time series is:</p>
<ul>
<li><a href="http://www.kva.se/globalassets/priser/nobel/2003/sciback_ek_en_03.pdf">http://www.kva.se/globalassets/priser/nobel/2003/sciback_ek_en_03.pdf</a></li>
</ul>
<p>Gekkoquant.com has some nice articles on ADF tests, cointegration, statistical arbitrage etc.:</p>
<ul>
<li><a href="http://gekkoquant.com/2012/12/17/statistical-arbitrage-testing-for-cointegration-augmented-dicky-fuller/">http://gekkoquant.com/2012/12/17/statistical-arbitrage-testing-for-cointegration-augmented-dicky-fuller/</a></li>
<li><a href="http://gekkoquant.com/2012/10/21/statistical-arbitrage-correlation-vs-cointegration/">http://gekkoquant.com/2012/10/21/statistical-arbitrage-correlation-vs-cointegration/</a></li>
</ul>
<p>Another excellent blog with various good articles on the topic is by <a href="http://epchan.blogspot.ch">Ernest Chan</a>. Search for “Cointegration” on the blog to find many more articles like these two:</p>
<ul>
<li><a href="http://epchan.blogspot.ch/2006/11/cointegration-is-not-same-as.html">http://epchan.blogspot.ch/2006/11/cointegration-is-not-same-as.html</a></li>
<li><a href="http://epchan.blogspot.ch/2013/11/cointegration-trading-with-log-prices.html">http://epchan.blogspot.ch/2013/11/cointegration-trading-with-log-prices.html</a></li>
</ul>
<p>Article on Pairs Trading at Godotfinance.com:</p>
<ul>
<li><a href="http://www.godotfinance.com/pdf/PairsTrading.pdf">http://www.godotfinance.com/pdf/PairsTrading.pdf</a></li>
</ul>
Evolving Trading Strategies With Genetic Programming - GP Parameters and Operators2014-11-01T00:00:00+01:00http://fabian-kostadinov.github.io/2014/11/01/evolving-trading-strategies-with-genetic-programming-gp-parameters-and-operators<h1 id="part-4">Part 4</h1>
<p>Genetic Programming at its core uses a set of operators (selection, mutation, crossover, elitism etc.) and parameters (number of generations, population size etc.). As there is a <a href="http://cswww.essex.ac.uk/staff/poli/gp-field-guide/">vast literature</a> on this subject, I will skip the basics and assume that the reader is already familiar with the topic.<span class="more"></span></p>
<p>The first thing to understand about GP parameters and operators is that they essentially make up a complex system. Changes in one parameter might or might not effect the way other parameters behave. Small changes in one parameter might lead to non-linearly changes in whole the evolutionary outcome. At other time a big change in parametrers might not change the overall result at all. A good example is setting the population size. Beginners exposed freshly to GP often think that the bigger the population size the better. After all, the more individuals in a population, the higher the overall chance a good search result will be found, right? Quite wrong. Often, a small (but not soo small!) population size is actually prefereable to a bigger one. There is a tradeoff between having more individuals in a population and increasing the overall level of noise. Having more individuals per population decreases the chance that a single, relatively fit individual will be able to create offspring. Hence, the level of noise is increased. Some commercially available GP software systems are advertised to be able to process thousends of individuals in a very short time. From my experience, if you need more than, let’s say, 500 individuals per population, you are probably doing something wrong. So, let’s take a closer look at some parameters and operators.</p>
<h2 id="population-size">Population Size</h2>
<p>I already mentioned the population size. Try to start with a small population of maybe 100 individuals. This will also decrease the computing time. As said above, increasing the population size will not necessarily lead to better results due to the increased level of noise.</p>
<h2 id="number-of-generations">Number of generations</h2>
<p>If your GP algorithm works, you will observe that in the early stages of the evolutionary process fitter individuals can be found relatively easily, although sometimes a few generations might pass by without any improvement before another big improvement happens. After a number of generations, improvements occur less frequently. Try starting with maybe 10 to 15 generations only. If there is only marginal improvement from the first to the last generation, then probably you are doing something wrong. It is also useful to introduce a <em>stall generations counter</em> and stop the evolutionary process if after a certain number of generations no further improvements was observed.</p>
<h2 id="number-of-decision-trees-and-tree-size">Number of decision trees and tree size</h2>
<p>I have already written on <a href="post_url 2014-09-03-evolving-trading-strategies-with-genetic-programming-encoding-trading-strategies %}">how to encode decision trees</a>. Once more there is a tradeoff. The more complicated your overall decision rules both concerning tree depth/size and number of subtrees used, the higher the chance of overfitting. Less complex decision rules are nearly always preferable, yet it will probably still make sense to use dedicated decision trees for long and short rules. I will focus on the subject of punishing complexity by means of parsimony pressure in a later article.</p>
<h2 id="genetic-programming-algorithm">Genetic Programming Algorithm</h2>
<p>Although the <a href="http://www.geneticprogramming.com/Tutorial/index.html">basic GP algorithm</a> is relatively simple, there are various alternatives to it, for example <a href="http://en.wikipedia.org/wiki/Linear_genetic_programming">Linear Genetic Programming</a>, <a href="http://researchcommons.waikato.ac.nz/bitstream/handle/10289/6838/Cartesian%20genetic%20programming%20for%20trading.pdf?sequence=1">Cartesian Genetic Programming</a>, <a href="http://www.cleveralgorithms.com/nature-inspired/evolution/nsga.html">Non-dominated Sorting Genetic Algorithm (NSGA)</a>, <a href="http://www.cleveralgorithms.com/nature-inspired/evolution/spea.html">Strength Pareto Evolutionary Algorithm (SPEA)</a> and many others. The choice of a particular alternative often has implications on how the individuals are encoded, on the fitness function, and also on the implementation of the selection, mutation and crossover operators. Traditionally, the mutation, crossover and elitism operator are used mutually exclusively and not in combination. That is, one of these operators is selected by an operator selection strategy, and only this operator is then applied. For instance, two children produced as offspring are not additionally subject to mutation, although this would of course be possible.</p>
<h2 id="selection">Selection</h2>
<p>A selection operator is a strategy how to select one or several individuals from a pool for a specific purpose such as mutation or crossover. There are many selection strategies and it really depends on the context which one to use preferably. Every selection strategy relies on a certain criterion to compare individuals. In most cases the best (in terms of fitness) individual is wanted, but sometimes it can also be the worst individual. In case of multi-objective fitness functions, the selection process might become quite complicated. Often, a selection operator must select two individuals and yet ensure not to select the same individual twice. Selecting only the best individuals for reproduction is not wise, because this could lead to a premature convergence to a local instead of global optimum in the search space. A balanced approach is necessary which gives better individuals a higher chance to be selected for reproduction while still continuing to also select weaker individuals. This will keep up a mix between survival pressure and leaving room for new solutions to appear and be explored.<br />
There are <a href="http://geneticprogramming.us/Selection.html">various selection methods</a>, but probably the most popular selection strategy is <a href="http://en.wikipedia.org/wiki/Tournament_selection"><em>tournament selection</em></a>. Tournament selection is a two-step process. First, a few (e.g. 7) individuals are selected randomly - the “tournament”. Second, one or two individuals are selected from the tournament according to their fitness. A larger tournament size gives weaker individuals an overall lower chance to be selected.</p>
<h2 id="elitism">Elitism</h2>
<p>Sometimes, it might be a good idea to simply allow the best one or two individuals in a population to be copied over to the next generation, because otherwise they might be lost. The problem with elitism is generally that without further measures taken the later generations might be filled up with many copies of identical individuals. Comparing individuals for equality might however be a computationally expensive operation. I personally prefer not to use elitism for this reason.</p>
<h2 id="mutation">Mutation</h2>
<p>Mutation should only occur with a low probability, e.g. in 0% - 2% of cases. Some GP implementations vary the probability of mutations during the evolutionary process. For the different versions of mutation (e.g. point mutation or subtree mutation), consult <a href="http://cswww.essex.ac.uk/staff/poli/gp-field-guide/52GPMutation.html">one of the many literature sources</a> on the subject. Be aware that mutation must comply with the <a href="/2014/09/03/evolving-trading-strategies-with-genetic-programming-encoding-trading-strategies/">typisation of nodes in a decision tree</a>.</p>
<h2 id="crossover">Crossover</h2>
<p>Crossover is the GP operator with the highest probability (e.g. 80% - 100%) for being chosen to produce offspring. Using typed nodes ensures that only valid and meaningful offspring is created by the crossover operator. One problem is that during the evolutionary process, this operator has a tendency to increase the average size of the decision trees and thus lead to code bloat. Often, the fittest individuals are the ones actually overfitting the historical time series. They are also the ones with the biggest decision rules. The selection operator picks them with the highest probability, and therefore the crossover operator subsequently produces offspring increased decision tree size. The proper counter-measure is parsimony pressure, which I will write about in a later post.</p>
<h2 id="the-problem-of-evolving-structurally-similar-individuals">The problem of evolving structurally similar individuals</h2>
<p>A common problem is that the evolutionary process often has a tendency to evolve many nearly identical trading strategies over the course of time. The population is filled up with a few “basic types” of individuals which are structurally very similar to each other and hence have very similar fitness values. To get the picture, imagine the same trading strategy for multiple individuals but each one having a very slight change in a single parameter only. There are different reasons why this phenomenon occurrs. One is that the evolution of fitness of the population’s best individual is discontinuous. If there is a significant “jump” in the best fitness from one generation to the next, chances are high that the best individual will stay the fittest for the next few generations - until another such jump occurrs. This is similarly true for the second and the third best individuals in the population. Therefore, for several generations in a row the few fittest individuals will have the highest chances for creating offspring, and naturally the offspring will be structurally similar to its parents. As an effect, more and more structurally similar individuals will fill up the first positions in the population thus effectively reducing the genetic variety in the whole population.</p>
<p>It is often quite challenging to find a satisfying solution for this problem. One possibility is to keep the mutation parameter at a high level. Another solution is to introduce adaptive GP parameters, for example varying the mutation and/or crossover probabilities throughout the evolutionary process. Or make dynamic changes to the selection operators. Unfortunately, these alternatives are often not very effective and can be quite tricky to implement or calibrate properly. A computationally very expensive solution is to compare trading strategies for each generation and disallow the population to contain structurally equal individuals. This requires a comparison of each individual with each other when building the next generation’s population, which will slow down the evolutionar process to a great extent.<br />
Interestingly, from my experience this problem is significantly reduced when switching from a single-objective to a powerful multi-objective fitness function/evolutionary process such as the <a href="http://www.cleveralgorithms.com/nature-inspired/evolution/nsga.html">non-dominated sorting genetic algorithm</a> (NSGA) and the <a href="http://www.cleveralgorithms.com/nature-inspired/evolution/spea.html">strength pareto evolutionary algorithm</a> (SPEA) algorithms. These algorithms both inherently define a different approach in selecting individuals for reproduction, which reduces the desired variety in the gene poole much less than single-objective fitness functions. I will write more on these two algorithms in a later article.</p>
<hr />
<h2 id="other-articles">Other articles</h2>
<h3 id="previous">Previous</h3>
<ul>
<li>Part 3: <a href="/2014/09/22/evolving-trading-strategies-with-genetic-programming-data/">Evolving Trading Strategies With Genetic Programming - Data</a></li>
</ul>
<h3 id="next">Next</h3>
<ul>
<li>Part 5: <a href="/2014/12/22/evolving-trading-strategies-with-genetic-programming-fitness-functions/">Evolving Trading Strategies With Genetic Programming - Fitness Functions</a></li>
</ul>
<hr />
<h1 id="references">References</h1>
Significance Testing of Pearson Correlations in Excel2014-10-30T00:00:00+01:00http://fabian-kostadinov.github.io/2014/10/30/significance-testing-of-pearson-correlations-in-excel<p>Yesterday, I wanted to calculate the significance of Pearson correlation coefficients between two series of data. I knew that I could use a Student’s t-test for this purpose, but I did not know how to do this in Excel 2013. And, to be honest, I did not really understand the documentation of Excel’s <code>T.TEST</code> formula. So, here is what I did.<span class="more"></span></p>
<h2 id="pearson-correlation-coefficient">Pearson correlation coefficient</h2>
<p>First, I had to calculate the corresponding Pearson correlation coefficients according to this formula:</p>
<p><img src="/public/img/20141030-pearson-correlation-coefficient-formula.png" alt="Pearson correlation coefficient formula" title="Pearson correlation coefficient formula" /></p>
<p>where <em>r<sub>xy</sub></em> is the Pearson correlation coefficient, <em>n</em> the number of observations in one data series, <em><span style="text-decoration: overline;">x</span></em> the arithmetic mean of all <em>x<sub>i</sub></em>, <em><span style="text-decoration: overline;">y</span></em> the arithmetic mean of all <em>y<sub>i</sub></em>, <em>s<sub>x</sub></em> the standard deviation for all <em>x<sub>i</sub></em>, and s<sub>y</sub> the standard deviation for all <em>y<sub>i</sub></em>.</p>
<p>Let’s assume, the data series to be correlated are stored in arrays <code>A1:A100</code> and <code>B1:B100</code>, thus <em>n = 100</em>:</p>
<p><code>=PEARSON(A1:A100;B1:B100)</code></p>
<p>Alternatively, you could also use the Correl function, which returns the same result:</p>
<p><code>=CORREL(A1:A100;B1:B100)</code></p>
<p>(I am using a Swiss German localization, therefore Excel’s delimiter for formula arguments is a semicolon <code>;</code> rather than a comma <code>,</code> in my case.)</p>
<p>Naturally, the returned correlation value is in the range of -1.0 to +1.0. This value is often referred to as <em>Pearson r</em> or r<sub>xy</sub> in our case.</p>
<h2 id="t-value">t-value</h2>
<p>Next, I calculated the corresponding t-values according to this formula:</p>
<p><img src="/public/img/20141030-t-value-formula.png" alt="t-value formula" title="t-value formula" /></p>
<p>where <em>t</em> is the t-value, which can be positive for positive correlations or negative for negative correlations, <em>r<sub>xy</sub></em> is the already calculated Pearson correlation coefficient, and <em>n</em> is the number of observations again (here <em>n = 100</em>).</p>
<h2 id="significance-testing">Significance testing</h2>
<p>Finally, I needed to decide whether the computed t-values were actually significant or not. For this purpose, we need to compare them to pre-calculated t-values available in a <a href="http://en.wikipedia.org/wiki/Student%27s_t-distribution#Table_of_selected_values">t-value table</a>. To do so, it is necessary to decide upon two things:</p>
<ol>
<li>What is the desired significance or confidence level, e.g. 95% or 99%?</li>
<li>Do I want to use a one-tailed or a two-tailed t-test?</li>
</ol>
<p>In my case, I decided to use a 95% significance level (which is a very common choice for data that are not highly critical). What confused me for some time is that some sources like the already linked <a href="http://en.wikipedia.org/wiki/Student%27s_t-distribution#Table_of_selected_values">Wikipedia-article</a> express confidence <strong>positively</strong> by stating the degree of certainty (e.g. 0.95 or 95%), whereas others <a href="http://www.socr.ucla.edu/applets.dir/t-table.html">like this one</a> express confidence <strong>negatively</strong> by the probability for being wrong (e.g. 0.05 or 5%). Other sources again <a href="http://www.medcalc.org/manual/t-distribution.php">like this one</a> actually state both in a single table. <em>It is essential that you know what the source you are using is referring to!</em> You can always carefully compare the t-values contained in different t-tables with each other to get an understanding whether someone actually takes one or the other perspective. Just check the t-table’s header line containing the probability values. Usually, the t-values are in ascending order from left to right, and in descending order from top to bottom. Hence:</p>
<div class="message">If you want to test the significance of a positive correlation, then you must check whether your t-value <i>t</i> is greater than a certain critical positive t-value <i>t<sub>crit</sub></i> at the right side of the t-distribution: Is t > t<sub>crit</sub>?<br />
<br />
If you want to test the significance of a negative correlation, then you must check whether your t-value <i>t</i> is below a certain critical negative t-value <i>t<sub>crit</sub></i> at the left side of the t-distribution: Is t < t<sub>crit</sub>?</div>
<p>Usually, a t-table only includes positive t-values, but not negative ones. How can we then test for the significance of negative correlations? It’s really easy. t-distributions are symmetric with a median of 0, thus their left and right tails look exactly the same. For this reason, we can change our test from “Is t < t<sub>crit</sub>?” to “Is ABS(t) > t<sub>crit</sub>?”. Instead of checking whether our t-value is below a critical negative threshold, we actually check whether its absolute value is greater than a critical positive threshold.</p>
<p>In case you do not have a very clear idea whether the expected correlation is actually positive or negative because it could be both it is better to use a two-tailed t-test. You can use a one-tailed t-test if you are only interested in one directionality of the correlation but not in the other (e.g. only positive but not negative, or only negative but not positive). Always remember:</p>
<div class="message"><center><em>The probability value of a two-tailed t-value is 2x the probability value of a one-tailed t-value:</em><br /><br />
<strong>p(t) two-tailed = 2 * p(t) one-tailed</strong></center></div>
<p>The t-table contains in the first column the <em>degrees of freedom</em>. This is usually the number of observations <em>n</em> (i.e. 100) minus some value depending on the context. When computing significances for Pearson correlation coefficients, this value is 2: <code>degrees of freedom = n - 2</code>.</p>
<p>We now have all information needed to perform the significance test. i) We have decided upon a confidence level of 95%, ii) we have decided to use a two-tailed t-test, iii) we have calculated the degrees of freedom to be 98 = 100 - 2. Looking up the critical threshold <em>t<sub>crit</sub></em> in the t-table we find that it is 1.984. Therefore:</p>
<ul>
<li>If r<sub>xy</sub> > 0 AND t > 1.984 then the Pearson correlation coefficient is significantly positive. This would be reported more compactly as: r<sub>xy</sub>(98) = <value>, p < 0.05 (two-tailed), where <value> is of course the calculated Pearson correlation coefficient.</li>
<li>If r<sub>xy</sub> < 0 AND ABS(t) > 1.984 then the Pearson correlation coefficient is significantly negative.</li>
<li>In all other cases the result is not significant.</li>
</ul>
<hr />
<h2 id="using-excel-for-t-tests">Using Excel for t-tests</h2>
<h3 id="tinv-tinv2t-tdist-and-tdist2t">TINV, T.INV.2T, TDIST, and T.DIST.2T</h3>
<p>Of course, calculating critical t-values can be done in Excel too. Before Excel 2010, there were only the TINV and TDIST formulas, now there are additionally the T.INV.2T and the T.DIST.2T formulas. All these formulas express confidence <em>negatively</em>, that is the probability value <em>p</em> represents the probability for being wrong.</p>
<table>
<tr>
<td><code>=TINV(p;df)</code></td>
<td>returns a t-value <i>t</i> for the given probability <i>p</i> and degrees of freedom <i>df</i>, assuming a two-tailed test. <code>=TINV(p;df)</code> is equivalent to <code>=T.INV.2T(p;df)</code>.</td>
</tr>
<tr>
<td><code>=T.INV.2T(p;df)</code></td>
<td>returns a t-value <i>t</i> for the given probability <i>p</i> and degrees of freedom <i>df</i>, assuming a two-tailed test. <code>=T.INV.2T(p;df)</code> is equivalent to <code>=TINV(p;df)</code>.</td>
</tr>
<tr>
<td><code>=TDIST(t;df;num-tails)</code></td>
<td>returns a probability value <i>p</i> for the given t-value <i>t</i>, the degrees of freedom <i>df</i>, and the number of tails <i>num-tails</i> (either 1 or 2). <code>=TDIST(t;df;2)</code> is equivalent to <code>=T.DIST.2T(t;df)</code>. Furthermore, <code>=2*TDIST(t;df;1)</code> is equivalent to <code>=T.DIST.2T(t;df)</code>.</td>
</tr>
<tr>
<td><code>=T.DIST.2T(t;df)</code></td>
<td>returns a probability value <i>p</i> for the given t-value <i>t</i>, the degrees of freedom <i>df</i>, and a two-tailed t-test. <code>=T.DIST.2T(t;df)</code> is equivalent to <code>=TDIST(t;df;2)</code>.</td>
</tr>
</table>
<p>TINV is the inverse of the TDIST formula and vice versa.</p>
<p><em>Examples:</em> Let <em>p = 0.05</em> (5% probability of being wrong), <em>n = 100</em> (therefore <em>df = 100 - 2 = 98</em>). We assume a two-tailed t-test.</p>
<ul>
<li>Calculating <em>t</em> for a given <em>p</em>: <code>=TINV(0.05; 100-2)</code> = <code>=T.INV.2T(0.05; 100-2)</code> = 1.9844675.</li>
<li>Calculating <em>p</em> for a given <em>t</em>: <code>=TDIST(1.9844675;100-2;2)</code> = <code>=T.DIST.2T(1.9844675;100-2)</code> = <code>=2*TDIST(1.9844675;100-2;1)</code> = 0.05.</li>
</ul>
<h3 id="tinv-and-tdist">T.INV and T.DIST</h3>
<p>Since Excel 2010, there is also a T.INV and a T.DIST formula. <em>Confusingly, they actually work quite differently from TINV and TDIST!</em> First, unlike TINV and TDIST, T.INV and T.DIST by default are <em>one-tailed</em>. Second, unlike TINV and TDIST, T.INV and T.DIST actually express confidence <em>positively</em>, that is the probability value <em>p</em> represents the degree of certainty.</p>
<table>
<tr>
<td><code>=T.INV(p;df)</code></td>
<td>returns a t-value <i>t</i> for a given probability <i>p</i> and degrees of freedom <i>df</i>, assuming a one-tailed test. <code>=T.INV(p;df)</code> is equivalent to <code>=TINV(2*(1-p);df)</code>.</td>
</tr>
<tr>
<td><code>=T.DIST(p;df;true)</code></td>
<td>returns a probability value <i>p</i> for the given t-value <i>t</i>, the degrees of freedom <i>df</i>, assuming a one-tailed t-test. <code>=2*(1-T.DIST(p;df;true))</code> is equivalent to <code>=TDIST(p;df;2)</code>. Furthermore, <code>=1-T.DIST(p;df;true)</code> is equivalent to <code>=TDIST(p;df;1)</code>.</td>
</tr>
</table>
<p><em>Examples:</em> Let <em>p = 0.05</em> (5% probability of being wrong), <em>n = 100</em> (therefore <em>df = 100 - 2 = 98</em>). We assume a two-tailed t-test.</p>
<ul>
<li>Calculating <em>t</em> for a given <em>p</em>: <code>=T.INV(1-0.05/2;98)</code> = <code>=TINV(0.05;98)</code> = 1.9844675.</li>
<li>Calculating <em>p</em> for a given <em>t</em>: <code>=2*(1-T.DIST(1.9844675;98;true))</code> = <code>=TDIST(1.9844675;98)</code> = 0.05;</li>
</ul>
<p>By the way, there is a nice <a href="http://www.real-statistics.com/students-t-distribution/t-distribution-basic-concepts/">blog post about t-tests in Excel</a>.</p>
Validation in JSONForm2014-09-22T00:00:00+02:00http://fabian-kostadinov.github.io/2014/09/22/validation-in-jsonform<p>Recently, I had created a user form with <a href="https://github.com/joshfire/jsonform">JSONForm</a>. However, the form was embedded in another site with its own <code>save</code> button. JSONForm usually adds its own <code>submit</code> button to the site, but <a href="https://github.com/joshfire/jsonform/wiki#fields-submit">you easily can remove it</a>. One of the problems left was how to trigger validation for the form manually once the site’s <code>save</code> button was clicked. Be aware that this implies triggering validation from outside the HTML <code>form</code> element. Whereas JSONForm’s <code>submit</code> button resides inside the <code>form</code> element as an <code><input type="button"></code>, my site’s <code>save</code> button does not. Unfortunately, the description in the JSONForm’s Wiki pages was not really understandable to me. It took me a long time but I finally succeeded to trigger JSONForm validation manually. Here’s what I did.<span class="more"></span></p>
<p>JSONForm’s validation relies on <a href="https://github.com/joshfire/jsonform/blob/master/deps/opt/jsv.js">its own adapted version of JSV.js</a>. You have to make sure that this library is accessible to your JSONForm code. In the simplest case you can include the Javascript in your HTML site before the jsonform library. (I renamed the file to <code>jsonform-jsv.js</code> to make clear that this is a modified version of JSV.js.)</p>
<figure class="highlight"><pre><code class="language-html" data-lang="html"><span class="nt"><html></span>
<span class="nt"><head></head></span>
<span class="nt"><body></span>
<span class="nt"><script </span><span class="na">type=</span><span class="s">"text/javascript"</span> <span class="na">src=</span><span class="s">"deps/jquery.min.js"</span><span class="nt">></script></span>
<span class="nt"><script </span><span class="na">type=</span><span class="s">"text/javascript"</span> <span class="na">src=</span><span class="s">"deps/underscore.js"</span><span class="nt">></script></span>
<span class="nt"><script </span><span class="na">type=</span><span class="s">"text/javascript"</span> <span class="na">src=</span><span class="s">"deps/opt/jsonform-jsv.js"</span><span class="nt">></script></span>
<span class="nt"><script </span><span class="na">type=</span><span class="s">"text/javascript"</span> <span class="na">src=</span><span class="s">"lib/jsonform.js"</span><span class="nt">></script></span>
<span class="nt"></body></span></code></pre></figure>
<p>In my case though all Javascript files were pre-compiled into a single, big, compressed file with <a href="http://browserify.org/">Browserify</a> and only this file was referenced from my HTML file. All Javascript libraries followed the <a href="http://requirejs.org/">Requirejs</a> standard. Hence, the - theoretically - correct way to add another library such as JSV.js would be to use a <code>require('JSV');</code> statement. In practice however this would not work, because doing so would actually imply loading the official version of JSV.js instead of JSONForm’s own adapted version. I finally settled with the solution to add the JSONForm’s version inside HTML together with the pre-compiled Javascript file:</p>
<figure class="highlight"><pre><code class="language-js" data-lang="js"><span class="o"><</span><span class="nx">html</span><span class="o">></span>
<span class="o"><</span><span class="nx">head</span><span class="o">><</span><span class="sr">/head</span><span class="err">>
</span><span class="o"><</span><span class="nx">body</span><span class="o">></span>
<span class="o"><</span><span class="nx">script</span> <span class="nx">type</span><span class="o">=</span><span class="dl">"</span><span class="s2">text/javascript</span><span class="dl">"</span> <span class="nx">src</span><span class="o">=</span><span class="dl">"</span><span class="s2">jsonform-jsv.js</span><span class="dl">"</span><span class="o">><</span><span class="sr">/script</span><span class="err">>
</span> <span class="o"><</span><span class="nx">script</span> <span class="nx">type</span><span class="o">=</span><span class="dl">"</span><span class="s2">text/javascript</span><span class="dl">"</span> <span class="nx">src</span><span class="o">=</span><span class="dl">"</span><span class="s2">my-precompiled-js-lib.js</span><span class="dl">"</span><span class="o">><</span><span class="sr">/script</span><span class="err">>
</span><span class="o"><</span><span class="sr">/body></span></code></pre></figure>
<p>Maybe not the most elegant solution, but it worked. It is important to understand that if you do not provide a JSV library, then clicking JSONForm’s <code>submit</code> button will actually trigger the HTML5 compliant browser-internal form validation. This is not really what you want to happen, as there are some important differences between JSONForm’s own validation procedure and the one provided internally by most modern browsers. For instance, JSONForm renders input elements with <code>type="number"</code> actually as text inputs but validates them as number inputs, whereas the HTML5 compliant browser-internal validation validates them as text input.</p>
<p>So, I still had to trigger the validation manually. This involved several steps. The problem is that there exists code for validation inside the JSONForm Javascript library, but unfortunately it is tightly bound to the <code>submit</code> button click. I did not want to change any code inside the JSONForm library and try to expose the validation function to the outside world. There are however two other exposed functions that I could rely upon: <code>myFormEl.jsonFormValue()</code> and <code>myFormEl.jsonFormErrors(errors, options)</code>. The first method returns all entered form values, the second one highlights invalid form input elements.</p>
<p>Thus, in my site’s <code>save</code> function, added the following code:</p>
<figure class="highlight"><pre><code class="language-js" data-lang="js"><span class="kd">var</span> <span class="nx">formEl</span> <span class="o">=</span> <span class="nx">$</span><span class="p">(</span><span class="dl">'</span><span class="s1">#myForm</span><span class="dl">'</span><span class="p">);</span>
<span class="kd">var</span> <span class="nx">env</span> <span class="o">=</span> <span class="nx">JSONFormValidator</span><span class="p">.</span><span class="nx">createEnvironment</span><span class="p">(</span><span class="dl">"</span><span class="s2">json-schema-draft-03</span><span class="dl">"</span><span class="p">);</span>
<span class="kd">var</span> <span class="nx">schema</span> <span class="o">=</span> <span class="p">{</span> <span class="dl">'</span><span class="s1">properties</span><span class="dl">'</span><span class="p">:</span> <span class="nx">myJsonForm</span><span class="p">.</span><span class="nx">schema</span> <span class="p">};</span>
<span class="kd">var</span> <span class="nx">report</span> <span class="o">=</span> <span class="nx">env</span><span class="p">.</span><span class="nx">validate</span><span class="p">(</span><span class="nx">formEl</span><span class="p">.</span><span class="nx">jsonFormValue</span><span class="p">(),</span> <span class="nx">schema</span><span class="p">);</span>
<span class="kd">var</span> <span class="nx">options</span> <span class="o">=</span> <span class="p">{};</span>
<span class="nx">formEl</span><span class="p">.</span><span class="nx">jsonFormErrors</span><span class="p">(</span><span class="nx">report</span><span class="p">.</span><span class="nx">errors</span><span class="p">,</span> <span class="nx">options</span><span class="p">);</span></code></pre></figure>
<p>There’s quite a lot contained in this code snippet, so let’s break it down.</p>
<ol>
<li>
<p><code>var formEl = $('#myForm');</code><br />
This line uses JQuery to get the HTML form element with the id of <code>#myForm</code>. You could achieve this without JQuery (e.g. <code>document.getElementById('myForm')</code>, but since JSONForm requires JQuery anyway, why not use it?</p>
</li>
<li>
<p><code>var env = JSONFormValidator.createEnvironment("json-schema-draft-03");</code><br />
Here we create a new JSV validator environment. This assumes that the variable <code>JSONFormValidator</code> is globally accessible. Alternatively, <code>JSV.createEnvironment("json-schema-draft-03")</code> would achieve the same.</p>
</li>
<li>
<p><code>var schema = { 'properties': myJsonForm.schema };</code><br />
This one took me long to figure out. <code>myJsonForm.schema</code> refers to the JSONForm’s <code>schema</code> property you had to define for creating the HTML form. Be aware how I wrapped the schema inside another object as a value of a property called <code>properties</code>. This is important! If you simply use <code>myJsonForm.schema</code> directly, the validation will never indicate any errors.</p>
</li>
<li>
<p><code>var report = env.validate(formEl.jsonFormValue(), schema);</code><br />
This code actually triggers the validation. We retrieve the user input with <code>formEl.jsonFormValue()</code> and use it together with the schema as an input to the validation function. Report will be an object containing an <code>errors</code> property, this is what we will need for further processing. If there are no errors, then <code>report.errors.length</code> will be 0. If it has length > 0 then there are errors. I used this information to abort the saving procedure.</p>
</li>
<li>
<p><code>var options = {};
formEl.jsonFormErrors(report.errors, options);</code><br />
<code>options</code> is simply an empty object. In the current JSONForm version, it does not serve any purpose, but it is still needed as a function parameter. Calling <code>formEl.jsonFormErrors(...)</code> will highlight invalid input elements on the page and also show a short error message to the user.</p>
</li>
</ol>
<p>Some remarks. First, JSONForm’s logic how to render certain specified input types to input elements is not always intuitive. For example, the schema specification <code>"foo": { "type": "number" }</code> results in an input element of type text: <code><input type="text"></code>. <a href="https://github.com/joshfire/jsonform/wiki#default-mapping">JSONForm on the default mapping</a>:</p>
<blockquote><ul>
<li>A <code>number</code> property generates a text input, i.e. an <input type="text"> element.</li>
<li>An <code>integer</code> property generates a text input as well, i.e. an <input type="text"> element.</li></ul></blockquote>
<p>What is even stranger is that specifying <code>type="number"</code> in the form section atually renders <code><input type="number"></code> elements.</p>
<p>Second, having an input fields set to <code>readonly</code> resulted in validation errors if I had specified this inside the schema section. It worked though if specified inside the form section.</p>
<table>
<tr>
<th>Did not work</th>
<th>Worked</th>
</tr>
<tr>
<td>
<figure class="highlight"><pre><code class="language-js" data-lang="js"><span class="p">{</span>
<span class="dl">"</span><span class="s2">schema</span><span class="dl">"</span><span class="p">:</span> <span class="p">{</span>
<span class="dl">"</span><span class="s2">foo</span><span class="dl">"</span><span class="p">:</span> <span class="p">{</span>
<span class="dl">"</span><span class="s2">type</span><span class="dl">"</span><span class="p">:</span> <span class="dl">"</span><span class="s2">string</span><span class="dl">"</span><span class="p">,</span>
<span class="dl">"</span><span class="s2">readonly</span><span class="dl">"</span><span class="p">:</span> <span class="dl">"</span><span class="s2">readonly</span><span class="dl">"</span>
<span class="p">}</span>
<span class="p">},</span>
<span class="dl">"</span><span class="s2">form</span><span class="dl">"</span><span class="p">:</span> <span class="p">[{</span>
<span class="dl">"</span><span class="s2">foo</span><span class="dl">"</span>
<span class="p">}],</span>
<span class="dl">"</span><span class="s2">onSubmitValid</span><span class="dl">"</span><span class="p">:</span> <span class="kd">function</span><span class="p">(</span><span class="nx">values</span><span class="p">)</span> <span class="p">{}</span>
<span class="p">}</span></code></pre></figure>
</td>
<td>
<figure class="highlight"><pre><code class="language-js" data-lang="js"><span class="p">{</span>
<span class="dl">"</span><span class="s2">schema</span><span class="dl">"</span><span class="p">:</span> <span class="p">{</span>
<span class="dl">"</span><span class="s2">foo</span><span class="dl">"</span><span class="p">:</span> <span class="p">{</span>
<span class="dl">"</span><span class="s2">type</span><span class="dl">"</span><span class="p">:</span> <span class="dl">"</span><span class="s2">string
}
},
</span><span class="dl">"</span><span class="nx">form</span><span class="dl">"</span><span class="s2">: [{
{
</span><span class="dl">"</span><span class="nx">key</span><span class="dl">"</span><span class="s2">: </span><span class="dl">"</span><span class="nx">foo</span><span class="dl">"</span><span class="s2">,
</span><span class="dl">"</span><span class="nx">readonly</span><span class="dl">"</span><span class="s2">: </span><span class="dl">"</span><span class="nx">readonly</span><span class="dl">"</span><span class="s2">
}
}],
</span><span class="dl">"</span><span class="nx">onSubmitValid</span><span class="dl">"</span><span class="s2">: function(values) {}
}</span></code></pre></figure>
</td>
</tr>
</table>
<p>Third, when I was using <a href="https://github.com/joshfire/jsonform/wiki#fields-arrays">JSONForm’s arrays of objects</a>, setting an object property to <code>required</code> in the schema section resulted in validation errors. If however specified in the form section, this did not seem to have any effect whatsoever, that is apparently no check was performed whether the required input fields were filled out or not.</p>
Evolving Trading Strategies With Genetic Programming - Data2014-09-22T00:00:00+02:00http://fabian-kostadinov.github.io/2014/09/22/evolving-trading-strategies-with-genetic-programming-data<h1 id="part-3">Part 3</h1>
<p>Genetic programming (GP) heavily relies on existing time series data. In this post I am going to look into different requirements and problems related to data.</p>
<p>First, we need to get the data from somewhere. There are different commercial or free data providers. Here is a list of free data providers.<span class="more"></span></p>
<ul>
<li><a href="http://finance.yahoo.com">Yahoo Finance</a>: Provides historical daily open/high/low/close/volume (OHLCV) quotes for many stocks as comma separated value (CSV) or Excel files. Data quality does not always reach a desirable level.</li>
<li><a href="http://www.finanzen.ch/">Finanzen.ch</a>: Provides stock data for download similar to Yahoo Finance.</li>
<li><a href="http://www.dukascopy.com/swiss/english/marketwatch/historical/">Dukascopy</a>: Free FX currency pairs data. For many FX pairs data is also available at a minute bar frequency.</li>
<li><a href="http://www.eoddata.com/">EODDATA</a> has options for both free and paid membership to obtain data in different formats for various trading platforms.</li>
</ul>
<p><small>(If you know other recommendable free data providers - especially for options data - let me know, I will add them to this list.)</small></p>
<p>Many technical indicators require OHLC(V) data. Should only close data be available, these indicators cannot be computed.</p>
<p>Once the data is obtained, it needs to be preprocessed first. Typical tasks include:</p>
<ul>
<li>Changing data formats, e.g. from Excel to CSV.</li>
<li>Handling missing data points, e.g. by looking them up at another data provider.</li>
<li>Checking for invalid data points, i.e. violations of <code>high<sub>t</sub> ≥ open<sub>t</sub>|close<sub>t</sub> ≥ low<sub>t</sub></code>.</li>
<li>Correcting for price drops of single stocks due to dividends payment. Not doing so will repeatedly unjustifiedly trigger sell rules during the training phase.</li>
<li>Handling stock splits. Yahoo Finance for instance sometimes provides an additional column <em>Adjusted Close</em>, which is adjusted for stock splits and dividends.</li>
<li>Handling weekends and holidays.</li>
<li>Mathematical transformations. In some cases it might make sense to logarithmize input data before using it.</li>
</ul>
<p>Because stock indices are replicated by exchange traded funds (ETFs) or derivative products such as contract for difference (CFDs), trading them actually implies trading the replica products. The replica products sometimes deviate from the stock index for one reason or another. Therefore, it is recommended to directly obtain historical data series of the replica product.</p>
<h2 id="noise-in-the-data">Noise in the data</h2>
<p>Theoretically, data obtained from different data providers should be the same for the same tradeable. This is however not necessarily the case, as this <a href="http://mechanicalforex.com/2014/09/machine-learning-in-forex-data-quality-broker-dependency-and-system-generation.html">blog entry by Daniel Fernandez</a> demonstrates clearly. Applying the same trading strategy with an hourly trading frequency in the EUR/USD market on data provided by different providers the achieved results differ significanty. Fernandez observes that</p>
<ol>
<li>the further back in past he goes the higher the difference of achieved results between the data sets,</li>
<li>this seems to be less of a problem if the trading frequency is lower (e.g. on a daily basis).</li>
</ol>
<p>One of his suggestions is to deliberately introduce a certain level of random noise to the data, so that the trading strategy is able to only follow the fundamental market movements. This is repeated many times to come to a conclusion on whether the strategy is still profitable or not.</p>
<p>Some authors also believe that a successful trading strategy must be profitable not only for one tradeable but for several similar ones. Also, Monte Carlo simulations can be helpful to determine a strategy’s historical performance.</p>
<h2 id="lookback-period">Lookback period</h2>
<p>Many technical indicators have a <em>lookback period</em>. The lookback period is the number of bars that must be available to calculate the indicator. For example, a simple moving average with a window size of 10 needs at least 10 bars of data before it can be computed for the first time. In GP it might be interesting to allow technical indicators calculated on derived time series such as a moving average calculated on a “highest high of the last <em>n</em> bars” indicator. Because the “highest high of the last <em>n</em> bars” also requires a lookback period, the full lookback period is equal to the sum of the two lookback periods.</p>
<p><em>Example:</em></p>
<ol>
<li>Indicator 1 is “highest high of last 5 bars”. Starting at bar 1, the first time the indicator is available is after close of bar 5.</li>
<li>Indicator 2 is “simple moving average of last 3 bars applied on previous indicator”. Starting at bar 5 the first time the indicator is available is after the close of bar 7.</li>
</ol>
<p>In case we allow long lookback periods (such as 150 days backwards for certain slow moving averages), the aggregated lookback periods can become very long. The final length also depends on the maximum allowed rule tree depth because the deeper rules can be nested the longer the aggregated lookback periods. Should the lookback period be markedly different between two trading strategies, then the one with the shorter lookback period has an advantage over the other as it has more opportunities for trading. (This might be desired though, because it favors trading strategies with shorter lookback periods.) A solution would be to always simply taking the maximum of the available lookback period.</p>
<h2 id="survivorship-bias">Survivorship bias</h2>
<p>A complicated issue is the <a href="http://en.wikipedia.org/wiki/Survivorship_bias"><em>survivorship bias</em></a> inherent in financial stock data. To quote <a href="http://en.wikipedia.org/wiki/Survivorship_bias">Wikipedia</a>:</p>
<blockquote>In finance, survivorship bias is the tendency for failed companies to be excluded from performance studies because they no longer exist. It often causes the results of studies to skew higher because only companies which were successful enough to survive until the end of the period are included.</blockquote>
<p>Unfortunately, stock indices like the S&P 500 or the DAX are also not free from this bias. These indices are updated regularly changing their constituents.<br />
At the same time some authors point out that there might also exist a <em>reverse survivorship bias</em> in the hedge funds world, where very successful hedge funds at a point in time close their doors to the public and stop reporting further success measures.<br />
Finally, to complicate things even more, merger & acquisitions are common phenomena in most developed economies.</p>
<p>Survivorship bias is a complicated topic and difficult to account for properly. It is thus good to keep in mind that future performance might well be below historical performance.</p>
<h2 id="how-much-data-is-needed">How much data is needed?</h2>
<p>In statistics a general rule-of-thumb exists that at least 10x more observations or data points are needed than variable model parameters. If less data is available then the whole model building process is not really trustworthy.<br />
A concept closely related is the <em>degrees of freedom</em>. The degrees of freedom (df) is equal to the number of observations in the data set minus the number of model parameters that may vary independently:<br />
<code>df = number of data points - number of independent model parameters</code><br />
Example: In case you have 5000 data points and your model has 30 independent parameters, you are left with 4970 degrees of freedom. According to <a href="http://en.wikipedia.org/wiki/Degrees_of_freedom_(statistics)#Residuals">Wikipedia</a>:</p>
<blockquote>A common way to think of degrees of freedom is as the number of independent pieces of information available to estimate another piece of information.</blockquote>
<p>A higher number of data points increases, and a higher number of model parameters decreases the degrees of freedom. The degrees of freedom are sometimes used as an input for certain statistical tests, which can be useful for building fitness functions. However, estimating the exact number of independent model parameters for a single trading strategy in GP is often more an art than a science. A simple guideline is to count the number of nodes in all sub-trees per trading strategy and possibly give complicated nodes (e.g. technical indicator nodes) a higher weight.</p>
<p>A fundamental problem persists though. GP tests thousands of different trading stragies with dozens of rule tree nodes on the same data set. Assuming we restrict our evolutionary process on 10 generations with 100 individuals only, each with an average individual having 15 nodes, this already results in 10 x 100 x 15 = 15’000 variable parameters being tested. Assuming we have in-sample data for 8 years of daily trading data with roughly 260 OHLCV bars per year, this results in 8 x 260 = 2080 data points, which is <em>much</em> less than the required minimum number of data points, no matter how we count nodes. We actually needed per-minute data ensure a sufficient number of data points. But even if we had data at a per-minute frequency, we could not simply switch our trading frequency from daily to minute trading, as we would automatically enter the world of high-frequency trading, which might be quite different from a medium-term (daily) trading frequency. The only, still unsatisfactory solution left is to restrict ourselves wherever we can, that is restrict the number of generations, the population size and the average number of nodes per individual.</p>
<h2 id="in-sample-vs-out-of-sample-data">In-sample vs. out-of-sample data</h2>
<p>The obtained data is split into an in-sample (IS) period for training and an out-of-sample (OOS) control period. The IS period should contain a variety of different market situations, i.e. bullish, bearish and trendless markets. It should be the larger portion of the data. The only use of the OOS data is to ensure that there is not a significant difference between the behavior of the trading strategy in the IS and the OOS data. Because both IS and OOS data are both historical data, a new trading strategy must first be tested in a walk-forward setting on real-market data for a certain time period. Only if the strategy’s performance continuously persists should it be considered trustworthy.</p>
<h2 id="multiple-tests-bias">Multiple tests bias</h2>
<p>Rarely is the evolutionary process run only once. A far more common work procedure is to let it run once, look at the IS and OOS results, change some settings and run the evolutionary process again. This work procedure is inherently dangerous as it carries a <em>multiple tests bias</em>. The more the GP process is run, the higher the chance that an apparently promising trading strategy is finally found both according to the IS and OOS performance. In statistics it is common to use confidence intervals to express the degree of confidence: “This trading strategy is profitable at a 95% confidence interval.” In other words, there is a 1 in 20 chance that this strategy only looks profitable but in actuality is not. Another interpretation of a confidence level at 95% is that out of 20 tested hypotheses 1 will mistakenly show up as valid, although it is not. Rerunning the GP procedure with adapted settings increases this chance continuously. To account for repeated, multiple tests the so called <a href="http://www.aaos.org/news/aaosnow/apr12/research7.asp"><em>Bonferroni adjustment</em> or <em>Bonferroni correction</em></a> has been proposed. The Bonferroni adjustment asks us to count the number of tests performed and at the end divide the statistical significance by this number. For example, if the confidence level is set at 95% and we conduct 5 tests, then the result is 95% / 5 = 19%. Be aware that “conducting a test” is actually not well defined in this context. If it means simply re-running the GP process, then this is still easy to compute. However, as many trading strategies are tested throughout each run it might be more natural to actually count every single trading strategy tested in all runs. Of course this would decrease the confidence level to such a low level that the whole data mining approach would be left useless. Authors like <a href="http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2308659">Bailey et al. (2014)</a> [1] are highly critical of backtested trading strategies, and their critique should not be taken lightly.</p>
<h2 id="data-snoopinglookahead-bias">Data snooping/lookahead bias</h2>
<p><em>Data snooping</em> or the <em>lookahead bias</em> refers to using data during backtesting at a point in time where this data would actually not have been available. Data snooping can either be an effect of programming errors or of invalid reasoning. A typical lookahead bias would be to calculate a technical indicator based on OHLC data at bar <em>t</em> and then still allow a trade at the same bar. Either the trade must be made at the open of the next bar <em>t+1</em>, or the current bar’s high, low and close are not yet available to calculate the indicator.</p>
<h2 id="predictability-tests">Predictability tests</h2>
<p>Some authors dispute the predictability of many financial time series altogether, often referring to the efficiency of market hypothesis. Biondo et al. (2013) [2] compare different random trading strategies with others based on technical indicators for indices such as the FTSE, DAX and the S&P 500. Not only do they come to the conclusion that there is little use in technical trading rules, but also - based on a Hurst exponent analysis - that none of these time series is likely to be predictable at all.<br />
Other authors take a less critical stance. Chen and Navet (2007) [3] (who by the way have both published numerous papers on GP for trading strategy development) for instance believe that some markets might indeed be efficient and thus inherently unpredictable, but others might not. Furthermore, the same market might actually be efficient/unpredictable at some times and inefficient/predictable at others. They suggest using statistical pre-tests to examine the situation.</p>
<ul>
<li><strong>Equivalent Intensity Random Search</strong>: This is one test mentioned in Chen and Navet’s paper. The idea is to compare the evolutionary procedure to a random search of equal intensity. The null-hypothesis to be rejected is that the evolutionary process is not better than the random search. If the evolutionary procedure is able to find trading strategies significantly better than the best randomly found search strategies then we have to conclude that indeed something can be learned from the time series by applying GP, and therefore reject the null-hypothesis.</li>
<li><strong>Lottery Trading</strong>: Another test mentioned in Chen and Navet’s paper. Theoretically, the best GP evolved trading strategy should significantly beat a trading strategy using random buy and sell signals with a comparable trading intensity.</li>
<li><strong>Runs Test</strong>: A description of the <em>Runs Test</em> can be found in a book by Vanstone and Hahn (2010) [4] or in this <a href="https://onlinecourses.science.psu.edu/stat414/node/233">online course of Pennstate University</a>. The basic idea is to measure if a “movement” in one direction is followed by another “movement” in the same or in the opposed direction. The Runs Test thus is a measure of the persistence of a trend. The more persistent trends are, the easier it is to trade them. This principle can be applied equally on measuring the distributions of winning and losing trades or on price movements.</li>
<li><strong>Serial Correlation Test</strong>: This test is described in a paper by Escanciano and Lobato (2008) [5]. The basic idea is not very different from the Runs Test. A time series following a random walk has no auto-correlation (that is, it has no correlation between one time step and the next). If in a time series a significant auto-correlation can be found, then it must be concluded that this time series does not follow a random walk. Fernandez, whom I mentioned further above, also has a <a href="http://mechanicalforex.com/2014/07/using-r-in-algorithmic-trading-testing-whether-an-instrument-follows-a-random-walk.html">short article on how to apply this test</a> on time series data using <a href="http://www.r-project.org/">R</a>.</li>
<li><strong>Hurst Exponent Test</strong>: This test relies on research conducted by Joseph Mandelbrot. Mandelbrot found that <a href="http://arxiv.org/abs/1110.5197">financial time series have a <em>memory effect</em></a>. The Hurst exponent tries to measure the strength of this memory effect. There are different methods how to calculate the Hurst exponent. Some introductory material can be found in this short <a href="http://www.dukascopy.com/fxcomm/fx-article-contest/?The-Hurst-Exponent-Background-Methodologies&action=read&id=1950">article on Dukascopy</a> and in <a href="http://www.analytics-magazine.org/july-august-2012/624-the-hurst-exponent-predictability-of-time-series">another one on analytics-magazine.org</a>.</li>
</ul>
<hr />
<h2 id="other-articles">Other articles</h2>
<h3 id="previous">Previous</h3>
<ul>
<li>Part 2: <a href="/2014/09/03/evolving-trading-strategies-with-genetic-programming-encoding-trading-strategies/">Evolving Trading Strategies With Genetic Programming - Encoding Trading Strategies</a></li>
</ul>
<h3 id="next">Next</h3>
<ul>
<li>Part 4: <a href="/2014/11/01/evolving-trading-strategies-with-genetic-programming-gp-parameters-and-operators/">Evolving Trading Strategies With Genetic Programming - GP Parameters And Operators</a></li>
</ul>
<hr />
<h1 id="references">References</h1>
<p>[1] Bailey D. H., Borwein J. M., Lopez de Prado M., Zhu Q. J. (2014): <a href="http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2308659">Pseudo-Mathematics and Financial Charlatanism: The Effects of Backtest Overfitting on Out-of-Sample Performance</a>. Notices of the American Mathematical Society. No. 61(5). p. 458-471</p>
<p>[2] Biondo A. E., Pluchino A., Rapisarda A., Helbing D. (2013): <a href="http://arxiv.org/abs/1303.4351">Are Random Trading Strategies More Successful than Technical Ones?</a> Available at arXiv.org at <a href="http://arxiv.org/abs/1303.4351">http://arxiv.org/abs/1303.4351</a>.</p>
<p>[3] Chen S.-H., Navet N. (2007): <a href="http://nicolas.navet.eu/publi/SHC_NN_Springer2007.pdf">Failure of Genetic-Programming Induced Trading Strategies: Distinguishing between Efficient Markets and Inefficient Algorithms</a>. In: Chen S.-H., Wang P. P., Kuo T.-W. (editors; 2007): Computational Intelligence in Economics and Finance. Springer-Verlag Berlin. p. 169 - 182</p>
<p>[4] Vanstone B., Hahn T. (2010): <a href="http://www.harriman-house.com/book/view/150/trading/bruce-vanstone-and-tobias-hahn/designing-stock-market-trading-systems/">Designing Stock Market Trading Systems - With and Without Soft Computing</a>. Harriman House Ltd, Petersfield UK.</p>
<p>[5] Escanciano J. C., Lobato I. N. (2008): <a href="http://www.eea-esem.com/files/papers/EEA-ESEM/2008/210/joejan2008.pdf">An Automatic Portmanteau Test for Serial Correlation</a>. Journal of Econometrics. Vol. 151, No. 2. p. 140 - 149</p>
Wordcloud of G. Deleuze's and F. Guattari's "A Thousand Plateaus"2014-09-17T00:00:00+02:00http://fabian-kostadinov.github.io/2014/09/17/wordcloud-of-deleuzes-and-guattaris-a-thousand-plateaus<p>This is a wordcloud of G. Deleuze’s and F. Guattari’s <em>A Thousand Plateaus: Capitalism and Schizophrenia</em> created with <a href="http://timc.idv.tw/wordcloud/">this nice tool</a>. I removed all words with less than a hundred occurrences, some abbreviations and otherwise not very expressive words such as <em>also</em>, <em>thus</em>, <em>and</em> etc. This is where I borrowed the term <em>rhizome</em> from. Enjoy!<span class="more"></span></p>
<p><img src="/public/img/wordcloud-a-thousand-plateaus.png" alt="Wordcloud of A Thousand Plateaus" title="Wordcloud of A Thousand Plateaus" /></p>
<p>By the way, I am definitely not sure if I would recommend the book. Read through <a href="http://www.amazon.com/A-Thousand-Plateaus-Capitalism-Schizophrenia/dp/0816614024">Amazon’s customer reviews</a> section, you’ll get the taste.</p>
This Page Is Intentionally Left Blank2014-09-16T00:00:00+02:00http://fabian-kostadinov.github.io/2014/09/16/this-page-is-intentionally-left-blank<p>I am always subtly amused when opening a book and one of the first pages encountered states that <em>this page is intentionally left blank</em>. Because, of course, it isn’t. There’s a statement printed on it. The situation reminds me of a first-time meditator deliberately trying to empty his or her mind of all thoughts - because, that’s how meditation is supposed to work, isn’t it. At least according to the first-time meditator’s belief. Yet, the more we try to empty our mind, the more we notice how distracted we actually are.<span class="more"></span> Thoughts arise as a continuous stream, and there is little we can do against it. So, after a few minutes of trying hard our first-time meditator finally gives up. What the meditator does not understand at this point is usually: The mind’s intention does not leave it blank. The intention alone is sufficient to actually create content. The closer we inspect our mind, the more we realize that intention and content in fact arise together. The situation is thus as absurd with the statedly empty book page. The very statement nullifies itself. This page is not left blank, and contrary to the meditator’s assumption our mind’s task is not to stay in a state of no-thought.</p>
<p>By the way, the true reason why sometimes the above statement is printed on otherwise empty pages is to assure the reader that no page is missing from a document such as a book or legal proceedings.</p>
Comparing Rhizomes to Other Technologies2014-09-16T00:00:00+02:00http://fabian-kostadinov.github.io/2014/09/16/comparing-rhizomes-to-other-technologies<p>At a first glance, rhizomes may have a lot in common with existing technologies. Yet, when taking a closer look, there are important differences, and it is not possible to simply reduce a rhizome to one or another existing technology. In this post I will quickly compare rhizomes to a variety of different mathematical and computational concepts and data structures.<span class="more"></span></p>
<h2 id="binary-decision-diagrams">Binary Decision Diagrams</h2>
<p>Here is <a href="http://en.wikipedia.org/wiki/Binary_decision_diagram">Wikipedia’s definition of binary decision diagrams (BDD)</a>:</p>
<blockquote>
In computer science, a binary decision diagram (BDD) or branching program [...] is a data structure that is used to represent a Boolean function. On a more abstract level, BDDs can be considered as a compressed representation of sets of relations. Unlike other compressed representations, operations are performed directly on the compressed representation, i.e. without decompression.
</blockquote>
<p>A BDD is a deviant of a binary tree. That is, in a BDD every node (except leaves) is connected to two children. Every node encodes a (sub-) function which returns either <code>1</code> (i.e. true) or <code>0</code> (i.e. false). Depending on the function output, the tree’s traversal is continued at the corresponding child until a leave is found. In this way, a path chosen through the tree and the whole tree eventually returns a bit string. BDDs have been used extensively in CAD software and also for formal verification purposes. More recently, alternative usages have also been proposed. (There’s an <a href="http://myvideos.stanford.edu/player/slplayer.aspx?coll=ea60314a-53b3-4be2-8552-dcf190ca0c0b&co=18bcd3a8-965a-4a63-a516-a1ad74af1119&o=true">online lecture of Donald Knuth - the man himself! - on BDDs at Stanford University</a>.)</p>
<p>Rhizomes are structurally similar to BDDs. The most important difference though is that BDDs are basically graphs whereas rhizomes are not. For example in BDDs it is not possible (and does not make much sense…) to connect a node to an edge or two edges with each other due to the underlying node/edge dichotomy. Rhizomes do not know such restrictions, because the only existing building block are relations. This means that every BDD can be implemented as a rhizome, but not every rhizome can be implemented as a BDD.</p>
<h2 id="set-theory">Set theory</h2>
<p>Set theory has been very fundamental to the design of (so called) relational databases. Set theory prescribes a hierarchy of sets and elements. The <em>inclusion</em> operator ∈ defines that an element <em>e</em> belongs to or is included in or is an element of a set <em>S</em>. The reverse is usually not allowed: <em>S</em> ∈ <em>e</em> is false or invalid in most real-world applications. For example, in a relational database a row cannot relate to its own table. (We could add the table name as an attribute to a row, but this would not be safe, because the table could be renamed without the row taking notice.)</p>
<p>Again, in rhizomes there are no such restrictions. As was <a href="http://www.google.com/patents/US20060155755">stated in a patent (later on withdrawn) by Erez Elul/Pile Systems</a>, inclusion is only a special form of connection. Every inclusion is a form of connection, but not every connection is an inclusion. Therefore rhizomes operate on more powerful concepts than set theory.</p>
<h2 id="object-oriented-programming">Object-oriented programming</h2>
<p>Similar arguments as for set theory also apply to object-oriented programming (OOP). To give an example: Inheritance is only one form of connection, but connection is a broader concept than inheritance. The same can be said about specification, usage and other types of relations between objects. Rhizomes allow all of these relations, OOP allows only these relations.</p>
<h2 id="lispclojure">Lisp/Clojure</h2>
<p>Lisp and Clojure are sometimes said to be <em>homoiconic</em>. <a href="http://en.wikipedia.org/wiki/Homoiconicity">Homoiconicity</a> refers to the fact that in these programming languages all data structures can be either interpreted as lists or as algorithms because the language’s syntax for both is the same. For example, the statement <code>(+, a, 3)</code> can indicate a list of three elements as well as the mathematical expression “apply the + operator on the variable a and the integer 3”. Rhizomes are also homoiconic, because every relation <em>r<sub>k</sub> <= (r<sub>i</sub>, r<sub>j</sub>)</em> both expresses a relational fact (data) as well as an algorithm that either reduces the relation to its z-pairing value <em>r<sub>k</sub></em> or expands the relata <em>r<sub>i</sub></em> and <em>r<sub>j</sub></em> further to ordered pairs. Of course, Lisp and Clojure are both complete programming languages, whereas rhizomes are - at least currently - not.</p>
<p>I am not an expert on this, but at least Clojure treats lists internally as immutable data structure. If we remove an element and add a new one, effectively a new list is created as a deviant from the original one. This implies that elements in a list are never really replaced or deleted. If a list with other elements is produced at runtime, this is in fact “a different” list. This is comparable to rhizomes, where deletions are unnecessary. If a relatum <em>r<sub>m</sub></em> is paired with another relatum <em>r<sub>n</sub></em>, and then it is re-paired with <em>r<sub>o</sub></em>, then <em>(r<sub>m</sub>, r<sub>n</sub>)</em> and <em>(r<sub>m</sub>, r<sub>o</sub>)</em> are in fact different relations. Their corresponding z-pairing values are not the same.</p>
Storing HTML in a Rhizome2014-09-15T00:00:00+02:00http://fabian-kostadinov.github.io/2014/09/15/storing-html-in-a-rhizome<p>This post demonstrates how it is possible to use rhizomes to store simple HTML.</p>
<p>Consider the following HTML.</p>
<figure class="highlight"><pre><code class="language-html" data-lang="html"><span class="nt"><html></span>
<span class="nt"><head></head></span>
<span class="nt"><body></body></span>
<span class="nt"></html></span></code></pre></figure>
<p>How could we store this in a rhizome? First of all, it would make sense to treat every HTML tag as an atomic symbol. There are three such symbols in the sample: <em>html</em>, <em>head</em> and <em>body</em>.<span class="more"></span> Let us assume that, unless qualified otherwise, the direction of a relation indicates a parent-child relationship. Thus, <em>(x, y)</em> must be read as “x <em>is parent of</em> y”. The <em>html</em> tag has two children and they are ordered. <em>html</em> is both parent of <em>head</em> and of <em>body</em>, but <em>head</em> is the first child and <em>body</em> is the second. How could we express this fact? At this time, it is useful to introduce <em>qualifiers</em>.</p>
<p><strong>Definition:</strong></p>
<div class="message">A _qualifier_ _q_ is a relation which, when paired with another relation _r<sub>i</sub>_, indicates how that relation _r<sub>i</sub>_ should be processed.</div>
<p>We will use square brackets <code>[]</code> to denote a relation as a qualifier.</p>
<p>In the next table terminal relations were assigned to each HTML tag. Additionally, two qualifiers were introduced, one for child relations and a second one for sibling relations.</p>
<table>
<tr>
<th>HTML Tag</th>
<th>Terminal Relation</th>
</tr>
<tr>
<td>[child]</td>
<td>(0, 0)</td>
</tr>
<tr>
<td>[sibling]</td>
<td>(1, 1)</td>
</tr>
<tr>
<td>html</td>
<td>(2, 2)</td>
</tr>
<tr>
<td>head</td>
<td>(3, 3)</td>
</tr>
<tr>
<td>body</td>
<td>(4, 4)</td>
</tr>
</table>
<p>It is now possible to express the HTML above as two different, subsequent relations:</p>
<ol>
<li><code>((html, [child]), head) = (((2, 2), (0, 0)), (3, 3))</code></li>
<li><code>((html, [child]), body) = (((2, 2), (0, 0)), (4, 4))</code></li>
</ol>
<p>Or expressed tree-like:</p>
<table>
<tr>
<td>
<figure class="highlight"><pre><code class="language-js" data-lang="js"><span class="p">(</span>
<span class="p">(</span>
<span class="nx">html</span><span class="p">,</span>
<span class="p">[</span><span class="nx">child</span><span class="p">]</span>
<span class="p">),</span>
<span class="nx">head</span>
<span class="p">)</span>
<span class="p">(</span>
<span class="p">(</span>
<span class="nx">html</span><span class="p">,</span>
<span class="p">[</span><span class="nx">child</span><span class="p">]</span>
<span class="p">),</span>
<span class="nx">body</span>
<span class="p">)</span></code></pre></figure>
</td>
<td>
<figure class="highlight"><pre><code class="language-js" data-lang="js"><span class="p">(</span>
<span class="p">(</span>
<span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">),</span>
<span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
<span class="p">),</span>
<span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="mi">3</span><span class="p">)</span>
<span class="p">)</span>
<span class="p">(</span>
<span class="p">(</span>
<span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">),</span>
<span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
<span class="p">),</span>
<span class="p">(</span><span class="mi">4</span><span class="p">,</span> <span class="mi">4</span><span class="p">)</span>
<span class="p">)</span></code></pre></figure>
</td>
</tr>
</table>
<p>Expressed graphically:</p>
<p><img src="/public/img/html-stored-as-rhizome-1.jpg" alt=""HTML Stored as Rhizome Disjunct"" title="HTML Stored as Rhizome Disjunct" /></p>
<p>Note that we did not add a comma <code>,</code> between the head and the body relation. The two sibling tags <em>head</em> and <em>body</em> are not directly related to each other. Adding a [sibling] qualifier, we pair the two relations and end up with:</p>
<ol>
<li><code>((((html, [child]), head), [sibling]), ((html, [child]), body)) = (((((2, 2), (0, 0)), (3, 3)), (1, 1)), (((2, 2), (0, 0)), (4, 4)))</code></li>
</ol>
<p>Or expressed as a tree:</p>
<table>
<tr>
<td>
<figure class="highlight"><pre><code class="language-js" data-lang="js"><span class="p">(</span>
<span class="p">(</span>
<span class="p">(</span>
<span class="p">(</span>
<span class="nx">html</span><span class="p">,</span>
<span class="p">[</span><span class="nx">child</span><span class="p">]</span>
<span class="p">),</span>
<span class="nx">head</span>
<span class="p">),</span>
<span class="p">[</span><span class="nx">sibling</span><span class="p">]</span>
<span class="p">),</span>
<span class="p">(</span>
<span class="p">(</span>
<span class="nx">html</span><span class="p">,</span>
<span class="p">[</span><span class="nx">child</span><span class="p">]</span>
<span class="p">),</span>
<span class="nx">body</span>
<span class="p">)</span>
<span class="p">)</span></code></pre></figure>
</td>
<td>
<figure class="highlight"><pre><code class="language-js" data-lang="js"><span class="p">(</span>
<span class="p">(</span>
<span class="p">(</span>
<span class="p">(</span>
<span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">),</span>
<span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
<span class="p">),</span>
<span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="mi">3</span><span class="p">)</span>
<span class="p">),</span>
<span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
<span class="p">),</span>
<span class="p">(</span>
<span class="p">(</span>
<span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">),</span>
<span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
<span class="p">),</span>
<span class="p">(</span><span class="mi">4</span><span class="p">,</span> <span class="mi">4</span><span class="p">)</span>
<span class="p">)</span>
<span class="p">)</span></code></pre></figure>
</td>
</tr>
</table>
<p>This is a graphical representation of the same tree.</p>
<p><img src="/public/img/html-stored-as-rhizome-2.jpg" alt=""HTML Stored as Rhizome Conjunct"" title="HTML Stored as Rhizome Conjunct" /></p>
<p>Using only the simple rules above, we can now go on adding tags. Each tag is either a child or a sibling in relation to another one.</p>
<p>Imagine that two clients store the same atomic symbol table and use the same encoding algorithm. It is now possible to send a single (possibly very long) integer number over a network, and the receiver can fully re-compute the complete HTML tree. Of course, a possibility for actually storing content is still missing. For this purpose, we could introduce a further qualifier [value]. Once the atomic symbol table is complete concerning valid HTML tags, we start adding one numbered variable per plain, textual content. A relation <code>(s<sub>i</sub>, [value])</code> would indicate a textual content stored in variable <em>s<sub>i</sub></em>. Of course it would be necessary to also submit the plain textual content of every variable over the network. The result would look something like this:</p>
<figure class="highlight"><pre><code class="language-html" data-lang="html"><span class="nt"><html></span>
<span class="nt"><head></span>
<span class="nt"><title></span>Hello World!<span class="nt"></title></span>
<span class="nt"></head></span>
<span class="nt"><body></span>
<span class="nt"><h1></span>My First Heading<span class="nt"></h1></span>
<span class="nt"></body></span>
<span class="nt"></html></span></code></pre></figure>
<table>
<tr>
<th>HTML Tag</th>
<th>Terminal Relation</th>
</tr>
<tr>
<td>[child]</td>
<td>(0, 0)</td>
</tr>
<tr>
<td>[sibling]</td>
<td>(1, 1)</td>
</tr>
<tr>
<td>html</td>
<td>(2, 2)</td>
</tr>
<tr>
<td>head</td>
<td>(3, 3)</td>
</tr>
<tr>
<td>body</td>
<td>(4, 4)</td>
</tr>
<tr>
<td>title</td>
<td>(5, 5)</td>
</tr>
<tr>
<td>h1</td>
<td>(6, 6)</td>
</tr>
<tr>
<td>[value]</td>
<td>(7, 7)</td>
</tr>
<tr>
<td>s<sub>1<sub></td>
<td>(8, 8)</td>
</tr>
<tr>
<td>s<sub>2</sub></td>
<td>(9, 9)</td>
</tr>
</table>
with <code>s<sub>1</sub> = "Hello World!", s<sub>2</sub> = "My First Heading"</code>.
The same procedure would be applicable to tag attributes. First, introduce a new qualifier [attribute]. Then, add all valid HTML attributes (such as id, name, href etc.) to the atomic symbol table. (Be aware that this does not prohibit us to create meaningless combinations of tags and attributes such as <code><table href=""></code>.) The textual content variables are then stored after the tag attributes in the atomic symbol table.
</sub></sub></td></tr></table>
Searching in Rhizomes2014-09-15T00:00:00+02:00http://fabian-kostadinov.github.io/2014/09/15/searching-in-rhizomes<p>In this post I will look into how it is possible to search in rhizomes. I will assume that the rhizome implementation relies on pairing functions to store relations. I described the basic necessary data structures <a href="/2014/09/09/implementation-of-rhizomes/">in a previous post</a>. The term “searching in a rhizome” is not defined precisely. Searching implies that some contextual order is given and that the search is conducted in relation to this order. For rhizomes (similar to graphs) it is not entirely clear what constitutes this contextual order, the the definition of contextual order might depend upon one’s situation.<span class="more"></span></p>
<p>Let us begin with a simple example and gradually continue to more complicated ones. First, let us assume we want to know whether a relation between the atomic symbols <em>A</em> and <em>B</em> exists. We can create the two relations <code>(A, B)</code> and <code>(B, A)</code>, look up the terminal relations for A and B and reduce them to their final z-pairing values.</p>
<ol>
<li><code>r<sub>(A, B)</sub> <= ((0, 0), (1, 1)) = (0, 3) = 9</code></li>
<li><code>r<sub>(B, A)</sub> <= ((1, 1), (0, 0)) = (3, 0) = 12</code></li>
</ol>
<p>Next, we simply look up each z-pairing values <code>9</code> and <code>12</code> in the hashmap of pairings. If there exists such a value, then the corresponding relation has been stored. One should note that the search can easily be parallelized.</p>
<p>Another example. This time we will look for a string of three atomic symbols, <em>A</em>, <em>B</em> and <em>C</em>. For the sake of the example, let us assume we know the ordering of the symbols from left to right (<em>A</em> followed by <em>B</em> followed by <em>C</em>), but not their nesting. In this situation, two different possible pairings exist: <code>((A, B), C)</code> and <code>(A, (B, C))</code>. We compute all possible nestings, replace the atomic symbols with terminal relations, reduce them to their z-pairing values and look them up in the hashmap. We can then determine which nestings actually exist. If none does then the ordered string <em>ABC</em> has not been stored. If we were only interested in the question if at least one nesting exists, but would not care which nesting this were, then we could immediately stop computation after we found the first one. Again, the whole process can be parallelized easily.</p>
<p>The situation becomes more complicated if we cannot guarantee the order between <em>A</em>, <em>B</em> and <em>C</em>. Basically, we need to compute all possible orders and nestings and then look them up individually. These are all possible orderings and nestings:</p>
<ul>
<li><code>((A, B), C)</code>, <code>(A, (B, C))</code>,</li>
<li><code>((A, C), B)</code>, <code>(A, (C, B))</code>,</li>
<li><code>((B, A), C)</code>, <code>(B, (A, C))</code>,</li>
<li><code>((B, C), A)</code>, <code>(B, (C, A))</code>,</li>
<li><code>((C, A), B)</code>, <code>(C, (A, B))</code>,</li>
<li><code>((C, B), A)</code>, <code>(C, (B, A))</code></li>
</ul>
<p>Fortunately, it is possible to memoize in-between results. Some relations are shared, for example <code>(A, B)</code> is part of both <code>((A, B), C)</code> and <code>(C, (A, B))</code>. If <code>(A, B)</code> does not exist, we can immediately stop computation for both cases and conclude that neither of these two more complex relations exist. We only need to compute all z-pairing values if none of the listed relations exists.</p>
<p>What if we do not only want to know <em>if</em> a relation actually exists but also <em>where</em> it is stored? As I mentioned further above as long as no clearly defined contextual order is given this is impossible to answer. A rhizome has no inherent order or indexing scheme. The term “where” is not defined. What is possible to answer though is questions of the type: <em>Does the relation represented by z-pairing value <code>z</code> contain the relation <code>(A, B)</code>?</em> We simply need to expand the relator <code>r<sub>z</sub></code> to its relata and continue down the rhizome tree recursively. If anywhere in the chain the relation <code>(A, B)</code> is found the answer is yes.</p>
<hr />
<h2 id="see-also">See also:</h2>
<ul>
<li><a href="/2014/09/09/implementation-of-rhizomes/">Implementation of Rhizomes</a></li>
<li><a href="/2014/09/08/a-short-introducton-to-rhizomes/">A Short Introduction to Rhizomes</a></li>
</ul>
Pile or Rhizome? A Short Pamphlet.2014-09-15T00:00:00+02:00http://fabian-kostadinov.github.io/2014/09/15/pile-or-rhizome-a-short-pamphlet<p><a href="http://namzezam.wikidot.com/">Erez Elul</a>, whose honor it is to have “discovered” or “invented” the data structures I deliberately continue to call <em>rhizomes</em>, originally named his invention <em>pile</em> (or <em>pile system</em>). The relatively few authors (for instance <a href="http://en.wikipedia.org/wiki/Peter_Krieg">Peter Krieg</a>, <a href="http://blog.ralfw.de/">Ralf Westphal</a>, <a href="http://ralfbarkow.wordpress.com/">Ralf Barkow</a>, Miriam Bedoni as well as others besides them), who both commented and contributed on the invention accepted this naming. I must say I never found the term particularly saying for several reasons.<span class="more"></span></p>
<ul>
<li>In my opinion the most significant characteristic of piles/rhizomes is universal connectivity. Everything can be connected to everything else. Quite differently, Erez Elul’s reasoning apparently focused much more on the fact that relations are “piled up onto each other” - hence the choice of term. In my eyes, this is however not the distinguishing criterion. For example, also in binary trees nodes are “piled” onto each other in a sense. Yet, unlike piles/rhizomes, binary trees follow the traditional node/edge dichotomy. They lack the feature of universal connectivity.</li>
<li>The term <em>pile</em> is quite close to <em>heap</em>, and both could be confused easily. Runtime environments for programming languages such as the <em>Java Virtual Machine</em> (JVM) internally use a datastructure called <em>heap</em> to hold objects in memory.</li>
<li>When I first by chance stumbled upon the term <em>rhizome</em> in Gilles Deleuze’s and Félix Guattari’s thinking, which they had borrowed from <a href="http://en.wikipedia.org/wiki/Rhizome">botanics</a>, I immediately found the term very fitting.</li>
</ul>
<blockquote>
Deleuze and Guattari use the term rhizome throughout their work, especially in their discussion of thought in "[A Thousand Plateaus](http://www.amazon.com/gp/product/0816614024)." They argue that traditional thought is tree-like, in that it follows a linear pattern, branching off at various points. Rhizomes, taken from a kind of [root system](http://en.wikipedia.org/wiki/Rhizome) found in nature, are non-linear, and non-hierarchical.
</blockquote>
<p>(<a href="http://www.critical-theory.com/rhizome/">Source</a>)<br />
Rhizomes are thus a philosophically well described concept, and a corresponding theory in computer science could possibly profit from earlier work done by these post-structuralists.</p>
<p>Of course there is no ultimate truth in either name choice, and there are good reasons to stick to the customary term <em>pile</em>.</p>
Stock Market Informational Websites I Like2014-09-12T00:00:00+02:00http://fabian-kostadinov.github.io/2014/09/12/stock-market-informational-websites-i-like<p>This is a non-exhaustive list of websites on stock market investing (and trading) I like for various reasons.<span class="more"></span></p>
<h2 id="yahoo-finance">Yahoo Finance</h2>
<p>Link: <a href="http://finance.yahoo.com/">http://finance.yahoo.com/</a><br />
Everyone knows this one. Still one of the best websites on stocks on the net. I particularly enjoy the news section on individual stocks. The message board can be informative sometimes, too. Same thing with the stock’s key statistics. There even is information on stock options available. Oh, and did I mention the historical data download? One of my most favorite features above all.<br />
What is probably less known is that Yaoo Finance also has a section with multiples by sectors and industries available at <a href="http://biz.yahoo.com/p/s_conameu.html">http://biz.yahoo.com/p/s_conameu.html</a>. This is really helpful to compare multiples like P/E, P/B or ROE of a single company to the rest of the sector or industry.<br />
One drawback is that data quality at Yahoo Finance is not always the best. So you have to be careful with the information you receive.</p>
<h2 id="google-finance">Google Finance</h2>
<p>Link: <a href="http://www.google.com/finance">http://www.google.com/finance/</a><br />
Not much to say here either. It’s a classic. There’s a stock screener, which is nice but does not keep me awake at night.</p>
<h2 id="seekingalphacom">SeekingAlpha.com</h2>
<p>Link: <a href="http://seekingalpha.com/">http://seekingalpha.com/</a><br />
Currently my favorite site on the net when it comes to stock investing. Even if you (like me…) don’t make it to the selected inner circle of <em>PRO members</em> there is still plenty of high quality articles with more investment ideas than you could ever care for. The best ones can be found in the <em>Top Ideas</em> section. A big plus is that you can comment on the articles and engage in a discussion with the authors. Free basic membership.</p>
<h2 id="4-traderscom">4-traders.com</h2>
<p>Link: <a href="http://www.4-traders.com/">http://www.4-traders.com/</a><br />
One particular nice feature of this site is that it displays analyst estimates and revisions graphically over the last few months. Enter a stock short code and then click on <em>Consensus</em> or <em>Revisions</em>. On your right hand side there are some boxes containg the current stock price distance to the analysts consensus targets. The rest of the site is not much different from <a href="http://www.reuters.com/">Reuters</a> or <a href="http://www.bloomberg.com/">Bloomberg</a>.</p>
<h2 id="finvizcom">Finviz.com</h2>
<p>Link: <a href="http://www.finviz.com/">http://www.finviz.com/</a><br />
Finviz’s lean design is just to my taste. The front site for every stock features the most important fundamentals on a single page. In contrast to many other stock market sites the website is not clustered with blinking banners and ads. There is also a stock screener with similar capabilities as the one provided by Google Finance.<br />
When it comes to doing research on sectors and industries, Finviz beats even Yahoo Finance. You can adjust the stock screener to display only a certain sector and industry and then screen through all different sorts of multiples for all the stocks displayed.</p>
<h2 id="stock-spinoffs">Stock Spinoffs</h2>
<p>Link: <a href="http://www.stockspinoffs.com/">http://www.stockspinoffs.com/</a><br />
Want to invest in spinoffs? Then this is your site. Although the website does not cover all US spinoffs, there are regular updates about many different ones with the most important information provided to you for free. Once you like an idea, you can still start doing your own research on the upcoming spinoff, but the site is definitely a great starting point.</p>
<h2 id="spinoff-monitor">Spinoff Monitor</h2>
<p>Link: <a href="http://spinoffmonitor.com/">http://spinoffmonitor.com/</a><br />
A neat site if you are interested in special situation stocks including restructurings, spinoffs and bankruptcy. All this website does is list different stocks currently in a special situation. You can subscribe with your email to get a weekly digest. A great place to get a first idea, which you can then research further.</p>
<h2 id="investopedia">Investopedia</h2>
<p>Link: <a href="http://www.investopedia.com/">http://www.investopedia.com/</a><br />
The best website on the internet when it comes to educating yourself in stock investing and economics. Lots of articles on ratios, fundamentals, technical indicators, micro- and macroeconomic topics and so on.</p>
<h2 id="stockchartscom">StockCharts.com</h2>
<p>Link: <a href="http://stockcharts.com/school/doku.php?id=chart_school">http://stockcharts.com/school/doku.php?id=chart_school</a><br />
Very informative website on technical indicators. You can find them all: moving averages, MACD, stochastic oscillators, directional movement index and many more.</p>
<h2 id="what-i-miss">What I miss</h2>
<p>There are also a few things I miss on the internet.</p>
<ul>
<li>A free-to-use stock screener which actually can combine several criteria with boolean values <em>and</em> can compare not only ratios to values but ratios to ratios. Example: “Find me all European stocks with P/E < 10 OR dividends per share > 10% of EPS.”</li>
<li>It is my impression that high quality information on US stocks is much easier to find than on European stocks. Maybe I am looking at the wrong websites or don’t understand the language. (I have not much experience with investing in stocks outside North-America and Europe.)</li>
<li>Free historical options data per stock. I’m sure there are many other people out there who miss such a feature.</li>
</ul>
Applications of Rhizomes2014-09-10T00:00:00+02:00http://fabian-kostadinov.github.io/2014/09/10/applications-of-rhizomes<p>The intention of this post is to shed a little light on possible applications of rhizomes. I am personally still struggling to figure out what the best use cases for rhizomes might be, but here is a list including some ideas nevertheless. The basic problem I am confronted with is my own tendency to interpret rhizomes in terms of existing data structures and algorithms instead of focusing on what is genuinely new. I have recently come to the (maybe preliminary) conclusion that rhizomes are at their best when it comes to <em>post-structuralist</em> applications. That is, applications where up-to-date a certain data structure is silently accepted as a best way to do things.<span class="more"></span> I call rhizomes to be post-structuralist because they violate the common paradigma of storing data on the one hand and linking it on the other hand. Rhizomes, compared to graphs, do not consist of nodes and edges, but only of relations. The main characteristic then, in my eyes, of rhizomes is that they allow unrestricted connectivity. Everywhere. There is nowhere an element which cannot be connected to any other due to an underlying paradigma. Therefore, there is no separation between data and structure but only “order” in the most abstract and fundamental way.</p>
<h2 id="hierarchical-file-systems">Hierarchical file systems</h2>
<p>My laptop’s file system is organized as a hierarchical tree. The file system contains two basic elements, directories (folders) and files. Each directory and file has a unique identity in the whole file system, defined through its <em>path</em>. A directory can contain any number of other directories and files, whereas a file cannot contain other directories of files. (This is not entirely true, as sometimes modern files such as packed .zip files are internally organized similar to a file system themselves.) Some file systems additionally allow the creation of symbolic links to alleviate the restrictions of the file system’s hierarchy a bit. Yet, these links can break if the link target is removed without updating or deleting the symbolic link.</p>
<p>One of the most severe restrictions is that a file cannot be placed inside two different directories at the same time. The only way to do this is to create a copy of the file and therefore redundancy. Imagine the following situation. A user organizes her blog around categories “food” and “animals”. For this purpose, she has created two directoreis “food” and “animals”. One day she has the idea to write about her dog’s favorite food. The only question is where to place the corresponding post file. Her computer’s file system prevents her from placing it in both directories, but that’s where it would actually belong.</p>
<p>In a rhizome, the file system’s hierarchy can be broken up. Files can be placed in several directories, it would even be possible to place two directories into each other or a directory into itself (perhaps with some modifications to the basic algorithm described in <a href="/2014/09/09/implementation-of-rhizomes/">my last post</a>.)</p>
<h2 id="ontologies">Ontologies</h2>
<p>Ontologies are another example where an implied structure can sometimes be painful. Most ontologies demand that a thing either belongs in one or another category, but not in two different ones at the same time. I still remember a biology class in high school, where under the teacher’s guidance we tried to figure out whether a virus should rightfully be categorized as a mere molecule or a living being. Most computational ontologies demand to make an exclusive decision, whereas in most cases our perception of the world is sensitive to contexts and to interpretation according to circumstances. Whether a virus is considered to be a molecule or a living being, or whether a person must correctly be categorized as a father, lover, brother, chess player or firefighter is actually dependent on our concerns and on the questions we ask. Rhizomes have no problem in putting a thing into several categories at the same time. With certain modifications the concept might be expanded to support contexts. Inside a given context, contradictory categorizations are not allowed. But a meta-algorithm - running on the same relations! - might be allowed to return contradictory search results and let the user decide which of the possible categorizations she prefers. The crucial point is that a rhizome would not require the creation of two different ontologies, but only to contextualize the stored relations.</p>
<p>Furthermore, in computational ontologies it is common to have a certain meta-architecture, which is not part of the ontology itself, for example data structures such as subject-predicate-object triples. In this triple the predicate <em>is</em> the relation between subject and object, but the relation between the subject and the predicate or the object and the predicate has no identity. Thus, unlike in traditional ontologies, a rhizome does not prohibit a developer to address these subject-predicate or object-predicate relations, or even predicate-predicate relations of two different triples and so on.</p>
<h2 id="object-oriented-programming">Object-oriented programming</h2>
<p>Similar thinking as for ontologies can be applied to object graphs in OOP languages. Whereas Java allows only single-parent inheritance, C++ allows multi-parent inheritance with all its dangers. Some languages like C++ allow pointers to be stored in variables and manipulated, whereas others like Java do not offer such functionality. Yet also in C++ a pointer is rather a “second class citizen” besides full blown objects. It cannot call a function, it cannot inherit from other objects and so on. The underlying paradigma is still that of nodes and edges. A rhizome however has no preference over its elements, relations are all there is.</p>
<p>Also, most modern OOP languages introduced <em>reflection</em> as a feature. Although in practice rarely used, in some cases reflection leads to highly elegant solutions, for example when creating a debugger application. Reflection offers a way to dynamically establish connections, for instance calling a method not through a direct pointer but by its name. Rhizomes could take this approach even further, as they would allow literally any element to be related to any other at runtime. Maybe it might be interesting to entirely switch an object’s class (or its methods or instance variables) at runtime.</p>
<p>An example. Assume that you wrote a class with a <code>logException</code> method. The objects of this class could be serialized and sent back and forth between the client and the server. On the server, the method should silently log exceptions to a log file, whereas on the client a popup should appear to the user. How would you implement this? Probably you would use an OOP <a href="http://en.wikipedia.org/wiki/Strategy_pattern">strategy pattern</a>. The purpose of this pattern is to decouple an object its method to select a certain behavior at runtime. The problem with this pattern is that something that is inherently a <em>behavior</em> (i.e. logging) is modeled as an <em>object</em> (<code>class ClientLoggingBehavior</code> and <code>class ServerLoggingBehavior</code>), and not only as <em>one</em> but instead as <em>two different</em> objects, because we need to introduce a new class per logging behavior. This is counter-intuitive. We human beings are very used to do things differently according to the context. In our reasoning we perform the “same” action (logging) in both contexts - just adapted to the needs of the current context. If the object and the method would be coupled through a rhizome, we could decide at runtime which method is currently attached to the object. Then there would be no need to write different classes and instantiate a client- or a server-behavior object before calling the log method.</p>
<h2 id="uniform-resource-identifier-uris">Uniform Resource Identifier (URIs)</h2>
<p>The internet is built upon the node/edge paradigm. Websites (nodes) can contain links (edges) to other websites. What is not possible to do is to link from one link to another link. Whereas an <em>Uniform Resoruce Identifier (URI)</em>, as its name says, identifies a resource on whe world wide web, the URI itself has no identity. It cannot be identified. Being able to create links to links could be a very useful feature, as it would allow to attach information to these URIs. For example, if a trustworthy source links to a certain other website, this might be much more relevant to me than if the link is from any other website. Or two users might disagree in some way on a certain link. This would however require us to somehow add the information on trustworthiness to the URI. And this is only possible if the URI is addressable itself, in other words, if I could create an URI to an URI.<br />
It would also be a nice feature if a link’s origin could either point to several targets or if a target could have several sources.</p>
<p>Rhizomes might support such architectures in one way or another. In a rhizome, not only the website’s URI is stored as a relation, but also the URI itself could be stored as a relation, therefore making the relation between website and URI addressable.</p>
<h2 id="other-examples">Other examples</h2>
<p>These are just a few examples presented here. Other suggested applications (using the term “pile systems”) can be found in several places:</p>
<ol>
<li>Krieg P. (2005): <em>Die paranoide Maschine - Computer zwischen Wahn und Sinn</em>. Telepolis/Heise Zeitschriften Verlag. Hannover.<br />
The book seems to me in some parts rather enthusiastic, and I am skeptical concerning some of the more elevated claims in it. Nevertheless there are some valid and interesting points, and there is a section on possible applications. (Here is another, somewhat critical <a href="http://www.vordenker.de/vgo/anmerkungen_paranoide-maschine.pdf">comment on this book by Eberhard von Goldammer</a>.)</li>
<li>Westphal R. (2006): <a href="http://www.lawsofform.de/wp-content/uploads/2006/05/RalfWestphalBlog.pdf"><em>A Journey Into The Pile Universe</em></a>. Blog entry available at: http://www.lawsofform.de/wp-content/uploads/2006/05/RalfWestphalBlog.pdf. The author shows how a “pile system” (rhizome) could be used for text storage and retrieval.</li>
</ol>
Implementation of Rhizomes2014-09-09T00:00:00+02:00http://fabian-kostadinov.github.io/2014/09/09/implementation-of-rhizomes<p>There are at least three different ways how to implement a rhizome on existing hard- and software platforms: as an object tree, relying on a programming language pointer arithmetic, or using mathematical pairing functions.<span class="more"></span></p>
<h2 id="implementation-as-object-tree">Implementation as object tree</h2>
<p>The most obvious option is to re-interpret a rhizome as an object tree and use an object-oriented approach to implementation. This is probably the easiest way to quickly come to a working solution. This works for a relatively small number of relations, but not for big numbers.</p>
<h2 id="implementation-based-on-pointer-arithmetic">Implementation based on pointer arithmetic</h2>
<p>A more sophisticated option is to use a programming language with pointer arithmetic capabilities such as C/C++ or C#. This was indeed Erez Elul’ and Miriam Bedoni’ choice when implementing the <a href="http://sourceforge.net/projects/pileworks/">first version of their <em>Pile Engine</em></a>. (I personally prefer to use the term <em>rhizome</em> for <em>pile</em> as I deem it more expressive.) There is also <a href="https://code.google.com/p/pile/">another version by Ralf Westphal (?)</a>, which I think is different from the first. Unfortunately, it is hard to find more information other than a few only partially functional links:</p>
<ul>
<li><a href="http://weblogs.asp.net/ralfw/441384">Ralf Westphal’s blog entry</a></li>
<li><a href="http://ralfbarkow.wordpress.com/2006/04/23/what-is-a-pile_object/">Ralf Barkow’s blog entry</a></li>
</ul>
<p>Although this is a highly viable option, it can become quite tricky to solve the implementation details, for example to parallelize the whole implementation.</p>
<h2 id="implementation-based-on-pairing-functions">Implementation based on pairing functions</h2>
<p>There is however a third, highly interesting alternative, described in a short <a href="http://ralfbarkow.wordpress.com/2006/06/21/the-cauchycantor-diagonal-method/">blog post by Ralf Barkow</a>. The idea is to use a mathematical <em>pairing function</em> for our implementation. A <a href="http://en.wikipedia.org/wiki/Pairing_function">pairing function</a> is a mathematical function taking two numbers as an argument and returning a third number, which uniquely identifies the pair of input arguments. It is always possible to re-compute the pair of arguments from the output value. Two pairing functions are currently known to me.</p>
<p><strong>Cantor pairing function</strong>:</p>
<figure class="highlight"><pre><code class="language-js" data-lang="js"><span class="nx">pair</span><span class="p">(</span><span class="nx">x</span><span class="p">,</span> <span class="nx">y</span><span class="p">)</span> <span class="o">=</span> <span class="nx">z</span> <span class="o">=</span> <span class="mi">1</span><span class="o">/</span><span class="mi">2</span> <span class="o">*</span> <span class="p">(</span><span class="nx">x</span><span class="o">^</span><span class="mi">2</span> <span class="o">+</span> <span class="mi">3</span><span class="nx">x</span> <span class="o">+</span> <span class="mi">2</span><span class="nx">xy</span> <span class="o">+</span> <span class="nx">y</span> <span class="o">+</span> <span class="nx">y</span><span class="o">^</span><span class="mi">2</span><span class="p">)</span>
<span class="nx">unpair</span><span class="p">(</span><span class="nx">z</span><span class="p">)</span> <span class="o">=</span> <span class="p">(</span><span class="nx">x</span><span class="p">,</span> <span class="nx">y</span><span class="p">)</span> <span class="o">=</span> <span class="p">{</span>
<span class="nx">x</span> <span class="o">=</span> <span class="nx">z</span> <span class="o">-</span> <span class="p">(</span><span class="nx">q</span> <span class="o">*</span> <span class="p">(</span><span class="mi">1</span> <span class="o">+</span> <span class="nx">q</span><span class="p">))</span><span class="o">/</span><span class="mi">2</span><span class="p">,</span>
<span class="nx">y</span> <span class="o">=</span> <span class="p">(</span><span class="nx">q</span> <span class="o">*</span> <span class="p">(</span><span class="mi">3</span> <span class="o">+</span> <span class="nx">q</span><span class="p">))</span><span class="o">/</span><span class="mi">2</span> <span class="o">-</span> <span class="nx">z</span> <span class="p">},</span>
<span class="kd">with</span> <span class="nx">q</span> <span class="o">=</span> <span class="nx">floor</span><span class="p">((</span><span class="o">-</span><span class="mi">1</span> <span class="o">+</span> <span class="nx">sqrt</span><span class="p">(</span><span class="mi">1</span> <span class="o">+</span> <span class="mi">8</span><span class="nx">z</span><span class="p">))</span><span class="o">/</span><span class="mi">2</span><span class="p">)</span></code></pre></figure>
<p><strong>Szudzik pairing function</strong>:</p>
<figure class="highlight"><pre><code class="language-js" data-lang="js"><span class="nx">pair</span><span class="p">(</span><span class="nx">x</span><span class="p">,</span> <span class="nx">y</span><span class="p">)</span> <span class="o">=</span> <span class="nx">z</span> <span class="o">=</span> <span class="p">{</span>
<span class="nx">x</span><span class="o">^</span><span class="mi">2</span> <span class="o">+</span> <span class="nx">x</span> <span class="o">+</span> <span class="nx">y</span><span class="p">,</span> <span class="k">if</span> <span class="nx">x</span> <span class="o">=</span> <span class="nx">max</span><span class="p">(</span><span class="nx">x</span><span class="p">,</span> <span class="nx">y</span><span class="p">)</span>
<span class="nx">y</span><span class="o">^</span><span class="mi">2</span> <span class="o">+</span> <span class="nx">x</span><span class="p">,</span> <span class="nx">otherwise</span>
<span class="p">}</span>
<span class="nx">unpair</span><span class="p">(</span><span class="nx">z</span><span class="p">)</span> <span class="o">=</span> <span class="p">(</span><span class="nx">x</span><span class="p">,</span> <span class="nx">y</span><span class="p">)</span> <span class="o">=</span> <span class="p">{</span>
<span class="nx">x</span> <span class="o">=</span> <span class="nx">z</span> <span class="o">-</span> <span class="nx">floor</span><span class="p">(</span><span class="nx">sqrt</span><span class="p">(</span><span class="nx">z</span><span class="p">))</span><span class="o">^</span><span class="mi">2</span><span class="p">,</span> <span class="nx">y</span> <span class="o">=</span> <span class="nx">floor</span><span class="p">(</span><span class="nx">sqrt</span><span class="p">(</span><span class="nx">z</span><span class="p">)),</span> <span class="k">if</span> <span class="nx">z</span> <span class="o">-</span> <span class="nx">floor</span><span class="p">(</span><span class="nx">sqrt</span><span class="p">(</span><span class="nx">z</span><span class="p">))</span><span class="o">^</span><span class="mi">2</span> <span class="o"><</span> <span class="nx">floor</span><span class="p">(</span><span class="nx">sqrt</span><span class="p">(</span><span class="nx">z</span><span class="p">))</span>
<span class="nx">x</span> <span class="o">=</span> <span class="nx">floor</span><span class="p">(</span><span class="nx">sqrt</span><span class="p">(</span><span class="nx">z</span><span class="p">)),</span> <span class="nx">y</span> <span class="o">=</span> <span class="nx">z</span> <span class="o">-</span> <span class="nx">floor</span><span class="p">(</span><span class="nx">sqrt</span><span class="p">(</span><span class="nx">z</span><span class="p">))</span><span class="o">^</span><span class="mi">2</span> <span class="o">-</span> <span class="nx">floor</span><span class="p">(</span><span class="nx">sqrt</span><span class="p">(</span><span class="nx">z</span><span class="p">)),</span> <span class="nx">otherwise</span>
<span class="p">}</span></code></pre></figure>
<p>Using one of the above pairing functions, we can now assign non-negative integer values to relators and relata. The relator <em>r<sub>k</sub></em> is assigned the pairing value <code>z</code>, and the relata <em>(r<sub>i</sub>, r<sub>j</sub>)</em> the ordered pair of values <code>(x, y)</code>. According to our definition, terminal relations are defined by “pointing to themselves”, in other words both relata being the same: <em>r<sub>k</sub> <= (r<sub>i</sub>, r<sub>j</sub>)</em> with <em>r<sub>i</sub> = r<sub>j</sub></em>. Thus, it is strictly guaranteed that terminal relations are the only ones where <code>x = y</code>.</p>
<p>This figure shows an example for the Szudzik pairing function.</p>
<p><img src="/public/img/szudzik-pairing-func.jpg" alt="Szudzik pairing function example" title="Szudzik pairing function example" /></p>
<p>There are seven terminal relations <code>(0, 0)</code> to <code>(6, 6)</code> representing the characters <em>A</em> to <em>G</em>. The grid shows the <code>z</code> pairing values for each ordered pair <code>(x, y)</code> according to the Szudzik pairing function. Let us assume we wanted to encode the String <em>ABC</em> as two nested ordered pairs <code>(A, (B, C))</code>.</p>
<ol>
<li>Replace all characters with their terminal ordered pair of values: <code>((0, 0), ((1, 1), (2, 2)))</code>.</li>
<li>Repeat replacing all ordered pairs with their z pairing value until there is only a single z value left: 1) <code>(0, (3, 8))</code>, 2) <code>(0, 67)</code>, 3) <code>4489</code>.</li>
</ol>
<p>The relator <em>r<sub>4489</sub></em> encodes the nested ordered pairs <code>(A, (B, C))</code>. Nesting and order do matter, the result would be different for <code>((A, B), C)</code> or <code>(A, (C, B))</code>.</p>
<p>Of course the process works in a reverse way too.</p>
<ol>
<li>Repeatedly expand all <code>z</code> pairing values by replacing them with an ordered pair of <code>(x, y)</code> values until you meet a pair where <code>x = y</code>: 1) <code>4489</code>, 2) <code>(0, 67)</code>, 3) <code>((0, 0), (3, 8))</code>, 4) <code>((0, 0), ((1, 1), (2, 2)))</code>.</li>
<li>Replace all ordered pairs by looking up their data items in the dictionary: <code>(A, (B, C))</code>.</li>
</ol>
<p>We can now traverse the rhizome upwards and downwards (from relata to relator and vice versa) without the need to store any data. The grid shown above only describes an algorithm how to pair values, it is not stored anywhere as real data content. The only place so far where actual data is stored is for the mapping of atomic symbols to terminal relations. This is very attractive because the symbol/terminal-dictionary grows only linearly with an increasing number of terminals.</p>
<p>So how do we keep track of which non-terminal relations have already been established and which have not? We know for example how to compute the z-pairing value for the pair of symbols <em>(B, C)</em>, but how can we know if such a pairing has been established or not? For this purpose, we need another data structure. Its only task is to keep track of what pairings have been established and which have not.</p>
<p><strong>Edit 18<sup>th</sup> August 2015: I have found a probably even more efficient way of storing pairings than described below. See my blog entry <a href="http://fabian-kostadinov.github.io/2015/08/18/implementation-of-rhizomes-2/">Implementation of Rhizomes - Part 2</a>.</strong></p>
<p>The simplest approach is to use a two-dimensional bit array equivalent to the logical grid above. If a bit is set in the 2d-array, this indicates that a connection exists between its indices <code>x</code> and <code>y</code>. This will result in a matrix which is quite densely populated for low <code>x</code> and <code>y</code> values but sparsely populated for high <code>x</code> and <code>y</code> values. Unfortunately, the matrix grows in size at O(n<sup>2</sup>). Even if we only store bits this can quickly eat up all our resurces. At least for the sparsely populated matrix area we should try to find leaner solution.</p>
<p>As an alternative, I suggest using one more hashmap (or possibly list) structure. The hashmap contains each relatum appearing in a relation as a key, and two lists as its value. One list contains all associative relata for this (normative) relatum, the other contains all normative relata for this (associative) relatum. The list themselves are stored as packed bit strings. The best bit string packing algorithms do not even require the bit string to be unpacked to check whether a certain bit is set or not, only for setting or deleting a bit unpacking and repacking is executed. (Daniel Lemire’s blog contains some articles on <a href="http://lemire.me/blog/archives/2012/03/06/how-fast-is-bit-packing/">bit packing in C++</a> and <a href="http://lemire.me/blog/archives/2013/07/08/fast-integer-compression-in-java/">Java</a>.) The list should be able to quickly return a list of relata this relatum is in a relation with. An algorithm I used personally is the <em>CONCISE algorithm</em> (<a href="http://ricerca.mat.uniroma3.it/users/colanton/publications.html">Colantonio</a> and Di Pietro [1]), which seems to even beat <em>Word Aligned Hybrid (WAH)</em> bitmap compression [4] on which for example the <a href="https://sdm.lbl.gov/fastbit/">FastBit library</a> [5], [6] is based. Colantonio wrote a highly efficient open source Java implementation of the CONCISE algorithm called <em>Extendedset</em>, which is available in two GitHub repositories [2], [3].</p>
<h2 id="some-remarks">Some remarks</h2>
<p>Some general remarks. First, using pairing functions only works if all terminals are assigned ordered pairs where <em>r<sub>i</sub> = r<sub>j</sub></em>. In the grid, all terminals are stored along the grid’s diagonale. The reason is that a stop criterion is required when traversing the rhizome in a reverse way from relator to relata. Imagine that we would allow terminals to be stored also outside the diagonale. Then, for example, the character <em>C</em> could be stored as <code>(0, 2)</code>. As both <code>0</code> and <code>2</code> appear as z-values (relators) in the grid, we cannot decide if the ordered pair <code>(0, 2)</code> already represents a terminal or must be unpaired even further to <code>((0, 0), (1, 0))</code>.</p>
<p>Second, the z pairing values can become quite large. With a linear growth of the x and the y values, the z values grow quadratically. Yet, at the same time, the expressive power or the information content of z-values also grows quadratically. At the same time, the need for larger z-values continuously decreases, as certain complicated combinations of terminals rarely occurr.</p>
<p>Third, this approach requires to decide upon a predefined set of terminals. It is impossible to decide a posteriori to split a terminal into several sub-relations. Whereas this sounds like a severe restriction, I believe that it is not on many cases. In most real-world examples there exists a naturally restricted vocabulary or set of atomic symbols, which we silently accept.</p>
<ul>
<li>For example in digital computing, the atomic symbols are the binary digits <code>0</code> and <code>1</code>. Or, taking a byte as an atomic symbol, the set of possible bytes contains only 256 different atoms.</li>
<li>In language, character sets such as ASCII or UTF-8 provide a large number of atomic symbols letter and number symbols which are expressive enough for the vast majority of use cases.</li>
<li>Natural language uses a limited - albeit large - number of words. Applying rhizomes on natural language processing we could actually create a combined set the most common words enhanced with the atomic letter and number symbols to create less common words.</li>
<li>The periodic table consists of a few dozens of chemical elements.</li>
<li>The human genomes consists of only four fundamental molecules, the guanine (G), adenine (A), thymine (T) and cytosine (C).</li>
</ul>
<p>Fourth, because rhizome trees do not store data items but only relations, multiple dictionaries can be applied on the same rhizome tree at the same time. In other words, the interpretation of any z-value or ordered pair (x, y) is entirely the responsibility of the user and the context. The situation is the same to processing binary data. The interpretation of a bit string like <code>00001010</code> is entirely left to the computational context, which not only defines whether this bit string is to be read from left to right or from right to left but also whether it designates a character, number, color, screen position, i/o address or something else. The minimum requirement for each context is a dictionary mapping atomic symbols to terminal relations and a second map or list keeping track of the actual pairings.</p>
<hr />
<h1 id="references">References</h1>
<p>[1] Colantonio A., Di Pietro R. (2010): <a href="http://arxiv.org/abs/1004.0403">CONCISE: Compressed ‘n’ Composable Integer Set</a>. Paper submitted to arXiv.org. No. 1004.0403.</p>
<p>[2] <a href="https://github.com/tuplejump/extendedset/">Extendedset: An open source Java implementation of the CONCISE algorithm authored by Alessandro Colantonio</a>. Version 1.</p>
<p>[3] <a href="https://github.com/metamx/extendedset">Extendedset: An open source Java implementation of the CONCISE algorithm authored by Alessandro Colantonio</a>. Version 2.</p>
<p>[4] Wu K., Otoo E. J., Shoshani A. (2001): <a href="https://sdm.lbl.gov/~kewu/ps/LBNL-48975.pdf">A Performance Comparison of bitmap indexes</a>. Proceedings of the tenth international conference on information and knowledge management (CIKM ‘01). p. 559-561</p>
<p>[5] <a href="https://codeforge.lbl.gov/projects/fastbit/">FastBit library on Codeforge</a>.</p>
<p>[6] <a href="https://github.com/gingi/fastbit">FastBit library on GitHub</a>.</p>
A Short Introduction to Rhizomes2014-09-08T00:00:00+02:00http://fabian-kostadinov.github.io/2014/09/08/a-short-introducton-to-rhizomes<p>Be <code>r<sub>k</sub> <= (r<sub>i</sub>, r<sub>j</sub>)</code> a <em>directed relation</em> <em>r<sub>k</sub></em> between the ordered pair of relata <em>r<sub>i</sub></em> and <em>r<sub>j</sub></em>, with <code>r<sub>k</sub> ≠ r<sub>i</sub></code> and <code>r<sub>k</sub> ≠ r<sub>j</sub></code>. We will call a relation a <em>terminal</em> iff <code>r<sub>i</sub> = r<sub>j</sub></code>. Furthermore, we call <em>r<sub>i</sub></em> (the left-hand-sided element in the ordered pair) the <em>normative relatum</em> and <em>r<sub>j</sub></em> (the right-hand-sided element in the ordered pair) the <em>associative relatum</em>.</p>
<p><em><strong>Definition</strong></em>:</p>
<div class="message">A _rhizome_ is recursively defined as a directed relation between two relata where both relata are directed relations themselves.</div>
<p><span class="more"></span>
To be really of use a rhizome must be complemented by a (bidirectional) dictionary (or similar data structure) that maps between terminals and data items. The rhizome itself does not store any data in a traditional sense, but only relations.<br />
Here is a picture of a sample rhizome:</p>
<p><img src="/public/img/simplerhizome.jpg" alt="A simple rhizome" title="A simple rhizome" /></p>
<p>This rhizome consists of four terminal and several non-terminal relations:</p>
<table>
<tr>
<th>
Terminals
</th>
<th>
Non-Terminals
</th>
</tr>
<tr>
<td>
<ul>
<li>r5 <= (r1, r1)</li>
<li>r6 <= (r2, r2)</li>
<li>r7 <= (r3, r3)</li>
<li>r8 <= (r4, r4)</li>
</ul>
</td>
<td>
<ul>
<li>r9 <= (r8, r5)</li>
<li>r10 <= (r6, r7)</li>
<li>r11 <= (r5, r8)</li>
<li>r12 <= (r7, r5)</li>
<li>r13 <= (r12, r10)</li>
<li>r14 <= (r15, r12)</li>
<li>r15 <= (r8, r10)</li>
<li>r16 <= (r13, r15)</li>
</ul>
</td>
</tr>
</table>
<p>r1 to r4 are relata but not relators. Furthermore, r9, r11, r14 and r16 are only relators but not relata.</p>
<p>On the right-hand-side of the rhizome, we added a sample dictionary. The four characters <code>A</code>, <code>B</code>, <code>C</code> and <code>D</code> are stored in the dictionary and they were assigned a terminal relation each. The rhizome tree can be traversed from top to bottom replacing each relator through its pair of relata until all relations are resolved to terminals. Finally, the terminals are replaced with values stored in the dictionary. This is a list of relators and the character strings they represent:</p>
<ul>
<li>r5 <= (r1, r1) = A</li>
<li>r6 <= (r2, r2) = B</li>
<li>r7 <= (r3, r3) = C</li>
<li>r8 <= (r4, r4) = D</li>
<li>r9 <= (r8, r5) = ((r4, r4), (r1, r1)) = DA</li>
<li>r10 <= (r6, r7) = ((r2, r2), (r3, r3)) = BC</li>
<li>r11 <= (r5, r8) = ((r1, r1), (r4, r4)) = AD</li>
<li>r12 <= (r7, r5) = ((r3, r3), (r1, r1)) = CA</li>
<li>r13 <= (r12, r10) = (((r3, r3), (r1, r1)), ((r2, r2), (r3, r3))) = CABC</li>
<li>r14 <= (r15, r12) = … = DBCCA</li>
<li>r15 <= (r8, r10) = … = DBC</li>
<li>r16 <= (r13, r15) = … = CABCDBC</li>
</ul>
<p>It should be obvious from this simple example that relators higher up in the rhizome’s hierarchy encode longer character strings. A rhizome is not a data storage container in a traditional sense. Whereas in (unpacked) data containers a character string like <code>DADADA</code> contains the same characters <code>D</code>, <code>A</code> and the combination <code>DA</code> thereof multiple times, a rhizome is always without redundancy. At the same time, data is never simply read and retrieved from the storage system but must be regenerated for every single access.</p>
<p>Working with rhizomes requires the user to define a vocabulary of atomic symbols (such as the letters <code>A</code> to <code>D</code> in the example). All more complex data types (e.g. <code>DDAC</code>) will be combinations of these atomic symbols. As the vocabulary’s atomic data symbols are de facto immutable and because there exists always exactly one relator per ordered pair of relata, updating or deleting existing relations is - at least theoretically - not necessary.</p>
<h2 id="a-comparison-with-graphs">A comparison with graphs</h2>
<p>In computer science, a vast number of data structures are commonly represented as graphs. According to most definitions, a <em>graph</em> is a set of nodes (vertices) and edges (links, archs, spikes) between the nodes. Edges can be directed or undirected. Accepting such a definition for a graph, it should be noted that a rhizome is <em>not</em> a graph, as the <em>only</em> elements it consists of are relations (or directed edges). In graphs it is impossible to draw a link from one edge to another because edges are not “adressable” or “linkable”. The only way to link one edge to another is to make it a node itself. A rhizome is therefore more general or fundamental than a graph.</p>
<p>A further important distinction is the distinct need for storage. The storage need for a graph with a high degree of connectivity grows exponentially with an increasing number of nodes. In contrast, the required storage for a rhizome grows logarithmically with an increasing number of relations. This is because the expressive power of higher order relations in a rhizome grows exponentially, whereas it remains flat for added nodes in a graph.</p>
Evolving Trading Strategies With Genetic Programming - Encoding Trading Strategies2014-09-03T00:00:00+02:00http://fabian-kostadinov.github.io/2014/09/03/evolving-trading-strategies-with-genetic-programming-encoding-trading-strategies<h1 id="part-2">Part 2</h1>
<p>As I have shown in a <a href="/2014/09/01/evolving-trading-strategies-with-genetic-programming-an-overview/">previous post</a> in GP entry and exit decision rules are encoded in a tree form. The decision rule tree returns a boolean value for every processed bar, which is interpreted as an entry (root node returning <code>true</code>) or exit (root node returning <code>false</code>) signal. Some authors suggest using a single rule tree for entry and exit signals, but I personally prefer evolving dedicated rule trees for both entry and exit rules, as I believe them to produce better signals.<span class="more"></span> During strategy evaluation the program alternately “reactivates” one or the other tree and ignores the signals produced by the inactive tree. This works for long-only or short-only trading. For long/short-strategies this approach leads to four different decision rule trees being evolved. This obviously also increases the degrees of freedom of our trading strategies and therefore the danger of overfitting.</p>
<h2 id="typed-nodes">Typed nodes</h2>
<p>I will not explain here in detail how the GP mutation and crossover operators work on decision rule trees as descriptions can be found in various books on GP (see for example Poli et al. 2008 [1]). In very basic GP implementations nodes in the decision trees are untyped. This is problematic as in such a situation mutation and crossover tend to create invalid decision trees with a high probability. Let us imagine that a moving average rule requires a lookback period size in the form of an integer value and a price series to operate on. A naive implementation of the mutation operator might simply randomly exchange the integer node with another technical indicator. Example: <code>EMA(Close, 12)</code> becomes <code>EMA(Close, ROC)</code> through random mutation. The mutated moving average node now receives a price series and another technical indicator as its inputs which results in invalid, meaningless code. The same problem also applies for the crossover operator.</p>
<p>Some people simply try to repeat running these operators until a valid solution is found by luck. However, as should be obvious, this is an extremely inefficient approach. A much better alternative is to introduce types for each node. By introducing input and output types for all nodes we make sure that the mutation and crossover operators can only chose from a predefined set of allowed alternatives, thus reducing inefficiently spent search time to an absolute minimum. This of course comes at a price. Implementation of typed nodes in GP is much more complicated than simply using untyped nodes. This is one more reason to rely on readily available GP libraries including this functionality instead of trying to implement everything from scratch anew.</p>
<p>As a consequence, the architecture of decision trees naturally follows a certain logical order.</p>
<ol>
<li>In the upper section of the tree nodes typically have the same return types as the root node, i.e. boolean values or a <em>buy/sell/do_nothing</em> signal which is aggregated in the root node.</li>
<li>In the upper middle section of the tree we often find nodes that are able to “convert” from one input type to another output type, for example the <code>>, <, >=, <=</code> or <code>crosses above/below</code> nodes which all take numeric or price inputs and return boolean outputs.</li>
<li>In the lower middle section we find the technical indicator rules, for example moving averages, oscillators and trend indicators based on open/high/low/close/volume (OHLCV) bar data.</li>
<li>The bottom section of the tree, the leave nodes, entirely consist of either ephemeral random constant (ERC) nodes or price series data nodes.</li>
</ol>
<p>Defining the return types of nodes requires careful reflection. It is my impression that many developers fail to understand this crucial point. Consider the following decision rule:</p>
<p><code>(high<sub>t-1</sub> - close<sub>t-1</sub>) > open<sub>t</sub></code></p>
<p>Whereas at a first glance there is nothing wrong with this rule from a mathematical standpoint, a closer inspection exposes it as being quite meaningless. Why is this so? Because the comparison operator is applied on two completely different objects! On the left-hand-side we have a <em>difference of</em> or a <em>distance between</em> two prices, whereas on the right-hand-side we have a <em>price</em>. Both have a very different semantical meaning. For the sake of the example, let us assume that <em>high<sub>t-1</sub> = 12.00$</em>, <em>close<sub>t-1</sub> = 11.66$</em> and <em>open<sub>t</sub> = 11.78$</em>. We therefore effectively compare <code>0.34$ > 11.78$</code>. Obviously this rule will always return false with almost certainty, because the last bar’s difference between the high and the close prices practically never exceeds the current open price! (One of the few real-world exceptions might be when a stock plunges down due to a bankruptcy announcement.) One must always be concerned with the question whether two nodes can meaningfully be compared to each other. Sometimes finding an answer is not easy.</p>
<p>Giving input and output types to nodes not only prevents the evolution of meaningless decision rules due to comparing incompatible return types, but also significantly decreases the already huge search space.</p>
<h2 id="ephemeral-random-constants">Ephemeral random constants</h2>
<p>Handling ERCs properly is another issue. Many technical indicators rely on OHLC(V) data and one or several integer or float constants. For example, a moving average has a lookback window size, which is encoded as an integer ERC. From the context we already know that this lookback window must be a positive value, all values <= 0 are disallowed. We can also in most cases define a meaningful maximum size for the same value, e.g. 500 trading days for a trading strategy with a daily trading frequency - having an even longer lookback period just does not make sense. In setting meaningful minimum and maximum sizes for ERCs we further restrict the search space. For floating point numbers, we could even only allow numbers from a pre-computed array, e.g. <code>[0.20, 0.21, 0.22, ..., 10.00]</code>. Such a design silently implies that all random mutations to a float ERC beyond a certain precision are effectively meaningless.</p>
<h2 id="beyond-entry-and-exit-signals">Beyond entry and exit signals</h2>
<p>More advanced trading strategies might not only consist of decision rule trees for long/short entry/exit signals, but additionally of (usually simpler) decision trees for position sizing or stop loss and take profit rules. This is an example for for a long position and a stop loss rule that is triggered if the price falls by more than 3% of the entry price:</p>
<p><code>current price <= ((1.0 - 0.03) * entry price)</code></p>
<p>The 3% value could of course be evolved using a float ERC. Common stop loss techniques include fixed price stops set at a certain distance from the entry price, percentage stops or average true range (ATR) stops. The same or similar techniques can be used for take profit rules.</p>
<h2 id="technical-indicators">Technical indicators</h2>
<p>Which technical indicator rules to select is largely a matter of taste. My assumption is that the most common indicators are also the best ones. That is, moving averages in general and MACD in particular, stochastic oscillators, A/D and Chaikin Oscillator, Directional Movement Index and so on are all promising canditates. There are many good sources of information about technical indicators and how to implement them, see for instance <a href="http://stockcharts.com/school/doku.php?id=chart_school:technical_indicators">stockcharts.com</a> or Colby’s <em>The Encyclopedia of Technical Market Indicators</em> [2].<br />
When it comes to how to encode technical indicators the procedure is:</p>
<ol>
<li>First, create a dedicated node per technical indicator you want to use. For example, although a MACD de facto consists of two moving averages nevertheless create a dedicated MACD node. The reason is simple. The chance for GP to evolve a powerful and complicated to compute technical indicator by itself is very small. If you however provide a dedicated MACD node right from the start as a possible candidate node the evolutionary process only has to select it from a list of possible alternatives. Of course nothing prevents you from additionally creating dedicated (simple, exponential, weighted etc.) moving average node types.</li>
<li>Next, decide upon the node’s output type. With a MACD type you are probably interested in the faster moving average crossing above or below the slower moving average. The MACD node could therefore return a boolean value, indicating whether the fast moving average is currently above the slow one or not. Or you could consider returning an enumeration constant, indicating whether a the fast moving average has just crossed above, below or not at all (e.g. <code>UP</code>, <code>DOWN</code>, <code>NONE</code>).</li>
<li>Decide upon the node’s inputs. A dedicated MACD node will certainly need i) an open or close price series to operate on, ii) a lookback period for the fast moving average, iii) a lookback period for the slow moving average. Maybe you also want to add an additional input node which decides upon which moving average type (e.g. simple vs. exponential) should be used internally of the MACD node.</li>
</ol>
<p>Some traders apparently believe that GP or other data mining strategies have the power to miraculously “find” new technical indicators never heard of, which then magically create the most extraordinary returns. Sorry to disappoint you guys, but that just won’t happen.</p>
<p><a href="http://ta-lib.org/">TA-Lib</a> is a good open-source library containing implementations for many technical indicators. It is available for a variety of programming languages such as C/C++, Java, Perl, Python and .NET. The only real drawback at the moment of writing is its lack of documentation.</p>
<hr />
<h2 id="other-articles">Other articles</h2>
<h3 id="previous">Previous</h3>
<ul>
<li>Part 1: <a href="/2014/09/01/evolving-trading-strategies-with-genetic-programming-an-overview/">Evolving Trading Strategies With Genetic Programming - An Overview</a></li>
</ul>
<h3 id="next">Next</h3>
<ul>
<li>Part 3: <a href="/2014/09/22/evolving-trading-strategies-with-genetic-programming-data/">Evolving Trading Strategies With Genetic Programming - Data</a></li>
</ul>
<hr />
<h1 id="references">References</h1>
<p>[1] Poli R., Langdon W. B., McPhee N. F., Koza J. R. (2008): <a href="http://cswww.essex.ac.uk/staff/poli/gp-field-guide/">A Field Guide to Genetic Programming</a>. Available online at <a href="http://cswww.essex.ac.uk/staff/poli/gp-field-guide/">http://cswww.essex.ac.uk/staff/poli/gp-field-guide</a></p>
<p>[2] Colby R. W. (2002): The Encyclopedia of Technial Market Indicators. Second Edition. McGraw-Hill, New York.</p>
Evolving Trading Strategies With Genetic Programming - An Overview2014-09-01T00:00:00+02:00http://fabian-kostadinov.github.io/2014/09/01/evolving-trading-strategies-with-genetic-programming-an-overview<h1 id="part-1">Part 1</h1>
<p>Writing a software program that creates - or to be more exact, evolves - trading strategies with genetic programming (GP) requires a set of design decisions to be taken concerning different aspects. In this article I will presume that the goal is to evolve a trading strategy consisting of technical indicators only, which will return entry and exit signals. Some elements like stop and limit order can be co-evolved with the basic entry and exit signals. Although it is also possible to use GP for stock selection purposes based on fundamental data, we will not look deeper into this possibility.<span class="more"></span></p>
<p>A complete trading strategy combines several of the following elements:</p>
<ul>
<li>Entry and exit indicators</li>
<li>Stop loss and take profit rules</li>
<li>Position sizing rules</li>
<li>Emergency shutdown strategy</li>
</ul>
<p>GP is not well-suited to cover all these elements equally. For example, having clear-cut rules when to shutdown a consistently failing strategy is a matter of monitoring one’s trading strategies and having a general money management system in place. Money management is probably the most important part of every trading stragegy, even more important than having the right exit rules in place, which in turn is more important than having the right entry rules. Money management depends on account size, personal preferences, trading style and many other variables beyond the power of GP. GP is mainly a search and optimization function, whereas proper money management requires several strategic decisions. It is a common mistake to believe that GP (or other data mining strategies) can miraculously come up with complete trading strategies on their own.</p>
<p>A point to consider is that evolving trading strategies with GP may not work for all markets. Some markets seem to be too efficiently priced for trading them successfully. Chen and Navet [1] describe some pretests that can be applied to the available data to distinguish in advance whether the selected market is likely to be tradeable or not.</p>
<h2 id="encoding-trading-strategies-as-rule-trees">Encoding trading strategies as rule trees</h2>
<p>In GP trading strategies are encoded as <em>decision rule trees</em>. A decision rule tree is a tree-like data structure that encodes an executable algorithm. By traversing the tree from the root towards its leaves, the algorithm is executed and a result is returned from the root node. An example is given in the next figure:</p>
<p><img src="/public/img/20140901_decision_tree.png" alt="Decision Tree" title="Decision Tree" /></p>
<p>This tree encodes the following decision rule:</p>
<p><code>
(EMA(Close, 12) > EMA(Close 26)) AND (0.0 < ROC(Close, 5))
</code></p>
<p>The root node of the rule tree returns either <code>true</code> or <code>false</code>. In the example, EMA means <em>exponential moving average</em>, ROC the <em>rate of change</em>, and Close the series of close prices. The rule would then be: <em>If currently not in a trade and the decision tree returns true, then enter a new long trade</em>.<br />
Two more remarks. First, in the figure each node has its own return type which matches the input type of another node. <em>Terminals</em> or <em>leaves</em> of course have no input nodes. This will be an important measure taken to avoid repeatedly breeding invalid rule trees. Second, <em>ERC_s is an abbreviation for _ephemeral random constant</em>. ERCs are leaves returning a single (often numeric) value such as an integer or a float. Handling ERCs correctly can decrease the size of the search space and thus improve search results.</p>
<h2 id="elements-of-genetic-programming">Elements of Genetic Programming</h2>
<p>Here is a list of the most important elements of every GP task:</p>
<ol>
<li>Data</li>
<li>GP parameters and operators</li>
<li>Fitness function</li>
<li>Parsimony pressure</li>
</ol>
<p>We will take a quick look into these points but leave the specifics for a series of later articles.</p>
<h3 id="1-data">1. Data</h3>
<p>GP heavily relies on existing time series data. These data must first be obtained from a data provider and be prepared manually for usage. Unfortunately, this is often a time-consuming and tedious task. Common problems are unsuitable data formats, missing data points or erroneous data. Single-stock data must be cleared for dividend payments, whereas stock index data is subject to an inherent “survivor’s bias”. A further problem is the lack of data. An often encountered rule of thumb is to have at least 10x more data points than degrees of freedom for all trading strategies tested. GP deals with multiple generations of populations of trading strategies thus magnifying the problem of sufficient data points substantially. Although the data requirements can practically never be met, nevertheless some measures can be taken to alleviate this problem.</p>
<h3 id="2-gp-parameters-and-operators">2. GP parameters and operators</h3>
<p>GP evolves generation after generation of populations of individuals, each individual representing one possible trading strategy. This evolutionary process relies on a set of traditional GP operators, namely:</p>
<table>
<tr>
<td>_Selection_</td>
<td>Selects one or several individuals from a pool for a specific purpose (e.g. reproduction) by a predefined protocol</td>
</tr>
<tr>
<td>_Mutation_</td>
<td>Creates a new individual from an existing one by creating a copy with a small, random change</td>
</tr>
<tr>
<td>_Crossover_</td>
<td>Creates offspring from parents according to a predefined protocol</td>
</tr>
<tr>
<td>_Elitism_</td>
<td>Allows the _n_ best individuals to directly copy themselves into the next generation</td>
</tr>
</table>
<p>Various implementation techniques exist for these operators, and each GP operator has different parameters to be set. Basic parameters are for instance the number of generations, the population size or the mutation probability. Parameter choices often affect the outcome of the evolutionary process in a non-linear, discontinuous way which is sometimes counter-intuitive to inexperienced users. Whereas the basic GP algorithm is actually surprisingly simple, implementing more advanced and powerful GP operators can turn out to be difficult. This is especially true if parallelism is to be used to achieve a better performance. Another common issue with more basic GP implementations is that later generations tend to fill up with identical individuals which causes the crossover and elitism operators to be ineffective.<br />
A variety of open-source, general purpose GP implementations exist. It is often advisable to at least do some preliminary research and study their code.</p>
<h3 id="3-fitness-function">3. Fitness function</h3>
<p>As GP is ultimately simply a search or an optimization function, it needs to know what to search for or what to optimize. The fitness function assigns every individual a fitness value. Indiviuals with a better fitness value have a higher probability to survive and reproduce. In case of creating trading strategies, an obvious fitness function is to measure the (relative or absolute) monetary performance of every evolved trading strategies over a historical time period. A good trading strategy increases profits and decreases the risks taken. Again, finding a good fitness function turns out to be a common stumbling block for many developers. Most beginners start with selecting either a maximized net profit or maximized hit rate (= number of correctly predicted price movements) for fitness function. Yet, these two measures are a poor choice for various reasons that will be explained later. In most cases, having <em>continuously</em> positive returns is more important than achieving the highest returns. Also, more advanced fitness functions both maximize and minimize several measures at the same time.</p>
<h3 id="4-parsimony-pressure">4. Parsimony pressure</h3>
<p>It is a well-known mathematical fact that sufficiently complicated functions can approximate literally any historical time series perfectly. (For the same reason repeatedly cunning pseudo-scientists appear in the media who claim to have successfully revealed some secret messages hidden in historical texts such as the bible. Given a sufficiently complicated search function, we can easily “discover” the <a href="https://www.youtube.com/watch?v=qE0B5rYdy8I">Beatles song <em>Yellow Submarine</em></a> in <a href="http://www.perseus.tufts.edu/hopper/text?doc=Perseus:text:1999.01.0218:book=1:card=1">Homer’s <em>Odyssey</em></a>.) Yet, the predictive power of these functions is nil. Without proper counter measures taken, GP has a tendency to lead to <em>code bloat</em>, i.e. to extremely large and complicated trading strategies with a truly wondrous performance on the in-sample data and total failure on the out-of-sample data. Parsimony pressure is applied to punish overblown complexity. Parsimony pressure is probably the most important and yet least understood concept when creating trading strategies with GP. Basically, two options exist to implement parsimony pressure: First, selecting a fitness function that assigns a low fitness to more complex trading strategies. Second, mechanically preventing the GP reproduction operators from building overly large offspring trading strategies. Since the first option alone is usually not enough to prevent code bloat, a combination of both approaches leads to an effective parsimony pressure technique.</p>
<hr />
<h2 id="other-articles">Other articles</h2>
<h3 id="next">Next</h3>
<ul>
<li>Part 2: <a href="/2014/09/03/evolving-trading-strategies-with-genetic-programming-encoding-trading-strategies/">Evolving Trading Strategies With Genetic Programming - Encoding Trading Stragegies</a></li>
</ul>
<hr />
<h2 id="references">References</h2>
<p>[1] Chen S.-H., Navet N. (2007): <a href="http://hal.archives-ouvertes.fr/docs/00/16/82/69/PDF/SHC_NN_Springer2007.pdf">Failure of Genetic-Programming Induced Trading Strategies: Distinguishing between Efficient Markets and Inefficient Algorithms</a>. In: Computational Intelligence in Economics and Finance, Volume 2. Springer Verlag, Berlin Heidelberg. pp 169-182.</p>