Skip to content
afiedler edited this page Jul 11, 2011 · 6 revisions

TSDB Tutorial

This document gives a brief introduction to the TSDB library and explains how to import a simple text file of intraday stock price bars using the TSDB utilities.

Our sample data

LIFFE FTSE-100 Futures data was downloaded from here. You can find the example file I'll be using here. When first loading data into TSDB, you should open up the raw data file in a text editor to determine the data format. Don't open the file in Excel! Excel will hide important formatting you will need when importing to TSDB, and Excel also might corrupt the timestamps.

I suggest using a programmer's editor, like Notepad++, that can show you control characters such as tabs and linefeeds. In Notepad++, you can see control characters by going to View > Show symbol > Show all characters. The first 8 lines of the example dataset are shown below (without the control characters you'd see in Notepad++).

<Type>,<Flags>,<Date>,<Time>,<Price/Bid/BidSize>,<Ask/AskSize>,<TradeSize>
BEST,X_._D,20040901,06:01:05,4475,4476
BEST,X_._D,20040901,06:57:48,4475,4426
BEST,X_._.,20040901,06:58:47,4526,4426
BEST,X_._.,20040901,07:49:29,4536,4426
TRADE,X_._.,20040901,08:00:09,4490.5,,694
BEST,._._.,20040901,08:00:09,4490.5,4491.5
TRADE,._._.,20040901,08:00:10,4491,,2

Note that fields are separated by a comma, and lines end with (you'd be able to see this with Notepad++!). Also note the first line is a header line (there is no footer). Finally, an important thing to observe is that the last field, TradeSize, is completely omitted for Type=BEST (i.e. there is no trailing comma when TradeSize is missing). Count the number of fields in lines 2-5 to check yourself.

Handling text files that are too large for Notepad++

Notepad++ has a limit of a few hundred megabytes when opening text files. If you just want to see the first few lines of a text files to observe its format, you can use the Unix utility "head". This can be downloaded for Windows here. For example, to show the first 10 lines of a file, run head -n 5 [filename] from the command prompt. To save this excerpt, run head -n 5 [filename] > excerpt.txt.

Creating a new TSDB file

We'll use the tsdbcreate program to create a new TSDB file that we will import this data into. In the TSDB file, I'm going to make two separate series, one for Type=TRADE and one for Type=BEST. At a command prompt, run the following

 tsdbcreate example.tsdb trades double price int32 size string(10) flags

This will create a new TSDB file called example.tsdb with a series called "trades" with four fields: price, a floating-point double field; size, a 32-bit integer field; flags, a ten-character string; and _TSDB_timestamp, an automatically-created timestamp field.

Now, let's create another series for the best prices (Type=BEST). Run the following:

 tsdbcreate example.tsdb best_prices double bid double ask string(10) flags

This will add another series to the example.tsdb file called "best_prices". It will have two double-precision floating-point fields, bid and ask, and one string field, flags.

A note on string fields

TSDB does not support variable-length records, so shorter strings will be padded the null characters up the the field length specified when creating the series. By forcing records to have a fixed-length, TSDB is able to increase query performance greatly. When you create a string field, this cannot be changed to accommodate larger strings in the future, so be sure to add a bit of padding to your string fields to accommodate future increases in size. Because TSDB files are compressed, extra padding will not contribute too much to file size. However, when the file is uncompressed in memory, excess padding will waste RAM. So, carefully balance future expandability with current memory considerations.

Creating XML Import Instructions

TSDB includes an optimized import program that can handle many different types of raw data files. To use the program, you create an XML file that describes the layout of the raw data file and how it maps to your TSDB file. Then, run the import program which will append the raw data onto the timeseries in your database. The XML import instruction files also are able to filter out unwanted rows of data, which will allow us to import the Type=TRADE and Type=BEST lines separately.

First, lets look at an import file for the Type=BEST lines.

<?xml version="1.0" encoding="UTF-8" ?>
<dataimport>
<delimparser field_delim=",">
    <tokenfilter tokens="0" comparison="NE" value="BEST" />
    <fieldparser name="_TSDB_timestamp" type="timestamp" tokens="2,3" format_string="%Y%m%d %H:%M:%S" />
    <fieldparser name="bid" type="double" tokens="4" />
    <fieldparser name="ask" type="double" tokens="5" />
    <fieldparser name="fields" type="string(10)" tokens="1" />
</delimparser>
</dataimport>

The delimparser block tells tsdbimport about the field delimiters and any escape characters. There are two different delimited text file parsers supported by tsdbimport: a very fast basic one, and a much slower extended version. The basic parser supports no escape characters, no quote characters, and only one possible field delimiter. For simple delimited text files, you should use the basic parser. This is the default, and specifying a delimparser block with just the field_delim parameter will get you the basic parser. Here is an example of a delimparser line that will get you the extended parser:

 <delimparser field_delim="," escape_chars="\" quote_chars="&#34;" parse_mode="extended">

This would switch the parser into extended mode, which would let you correctly parse a more complex CSV file like this:

 2001-01-01,05:32:45,"This is a \"quoted string\"","45.6",34

You must specify parse_mode="extended" to get the extended parser. Note that the extended parser is at least an order of magnitude slower, so when possible use the basic parser. Also note that you can use XML entities (such as &#9; for a tab or &#34; for double quotes).

Within the delimparser block, there are two possible XML elements: <tokenfilter> and <fieldparser>. A <tokenfilter> will throw out all lines of the incoming data file where one of the line's tokens (tokens meaning fields separated by a delimiter) does not match the given rule. In this example, all lines where the first token (tokens="0") are not equal to "BEST" are thrown out. This leaves just the lines where Type=BEST to be imported.

The <fieldparser> element tells the parser how to turn one or more tokens in the raw data file into a value to store in the TSDB database. For the _TSDB_timestamp field parser, we are parsing tokens 2 and 3 as a timestamp and saving the result in the _TSDB_timestamp field. Note that in both the token filters and field parsers, you can specify more than one token separated by commas (like tokens="2,3" here). Then tsdbimport will join these two tokens by a space before sending the result to either the token filter or field parser.

For field parsers, you must specify the type of the resulting field. This affects the parser routine that tsdbimport calls internally, and must match the field type used when creating the database. For timestamp fields, you also need to specify a timestamp format for parsing in the format_string. To see the different format specifiers you can use, check out the list here.

To import the example file, save the XML import instructions as say liffe.xml, and use the tsdbimport command:

 tsdbimport liffe.xml LZ4U_LEVEL1.csv example.tsdb best_prices

This command will import the data file into the database. If the series in the database already has data, then the new data will be appended to the end. Note: TSDB does not support mis-ordered data. Your import files must be sorted from oldest to newest. Also, if appending on to a series that already has data, all of the new data must come after the end of the existing data. The tsdbimport program will throw out incoming data that is before the last timestamp of the data already in the database (and tsdbimport will print an error in this case).

Here is an example XML file for importing Type=TRADE lines:

<?xml version="1.0" encoding="UTF-8" ?>
<dataimport>
<delimparser field_delim=",">
    <tokenfilter tokens="0" comparison="NE" value="TRADE" />
    <fieldparser name="_TSDB_timestamp" type="timestamp" tokens="2,3" format_string="%Y%m%d %H:%M:%S" />
    <fieldparser name="price" type="double" tokens="4" />
    <fieldparser name="trade_size" type="int32" tokens="6" missing_token_replacement="-1" />
    <fieldparser name="fields" type="string(10)" tokens="1" />
</delimparser>
</dataimport>

The one thing to note in this case is the missing_token_replacement="-1" line for the "trade_size" field parser. Without this line, if the 7th token (token number 6, since numbering starts at 0) is missing, tsdbimport will throw an error an quit.

A note about missing values in TSDB

TSDB uses the computer's internal representation for integers and floating-point numbers, and unlike other databases, there is no concept of NULL or missing values. However, floating-point numbers have a few special values that have come to mean "missing" (see the Wikipedia article on IEEE 754 Floating Point Special Values). In TSDB, if a token to be parsed into a "double" field is empty or just whitespace, it is saved as a NaN value to mean missing. However, for an integer field (int32 or int8), there is no concept of a "missing value" or NaN. If TSDB parses a blank token into an integer field, the number saved in the database will be zero. Therefore, if you expect a field to have missing values, you should use the "double" field type.

Clone this wiki locally