Skip to content

psteitz/pytink

Repository files navigation

Stock Price Prediction with Transformers

A transformer-based model that treats stock price movements as a sequence-to-sequence modeling problem, predicting future prices from previous sequences of prices.

Motivation

Just as transformer models learn to predict the next word in a sentence, this system learns to predict the next "word" of price movements across a portfolio of stocks. Each "word" encodes simultaneous price changes across stocks over a given time interval.

Stock prices don't move in isolation. The idea here is to see if an attention-based approach can work to learn relationships between stock movements over time. Models focus on a vector of stocks. The length is configurable, defaulting to 20 randomly selected high-volume stocks. Each (configurable) time increment, changes are recorded for each stock in the vector. The changes are quantized into (configurable) bins (e.g, [-.01, -.005, -.0001, 0, .0001, .005, .01]) which are mapped to letters. The letters are concatentated to form tokens and the transformer model is trained to predict the next token in the sequence.

How It Works

  1. Quantize price changes into discrete symbols (a-g representing -1% to +1%)
  2. Create tokens by concatenating symbols for all stocks at each time interval
  3. Build sequences of consecutive tokens
  4. Train a transformer to predict the next token given previous tokens

The model learns which price movement patterns tend to follow other patterns—essentially learning the "grammar" of market movements.

Project Overview

This project trains a transformer model to predict the next sequence of stock price changes given a history of previous sequences. Stock price changes are encoded as letters representing different percentage change ranges.

Delta Encoding

Delta-encoding maps price changes to letters. Encodings are configurable. The default is:

Price change

  • a: -1.0% (-.01)
  • b: -0.5% (-.005)
  • c: -0.1% (-.001)
  • d: 0.0% (0)
  • e: +0.1% (+.001)
  • f: +0.5% (+.005)
  • g: +1.0% (+.01)

Actual price changes are mapped to the nearest neighbor from the list above.

Note that this makes all changes of magnitude more than 1% map to 'a' or 'g'.

Tokens

Tokens are formed by concatenating the change encodings for each of the stocks in a list of stocks included in a model.

Example

The token acgaeb means:

  • Stock 1: a (-1%)
  • Stock 2: c (-0.1%)
  • Stock 3: g (+1%)
  • Stock 4: a (-1%)
  • Stock 5: e (+0.1%)
  • Stock 6: b (-0.5%)

Project Structure

pytink/
├── pyproject.toml           # Package metadata & entry points
├── requirements.txt         # Python dependencies
├── config_template.yaml     # Configuration template
├── test_installation.py     # Verify installation
├── src/
│   └── pytink/              # Main package
│       ├── database.py      # MySQL database interface
│       ├── processor.py     # Price processing and delta encoding
│       ├── model.py         # PyTorch model and dataset classes
│       ├── analysis.py      # Visualization utilities
│       ├── farming.py       # Automated model generation
│       ├── train_model.py   # Training CLI
│       └── inference.py     # Evaluate trained models
├── tests/                   # pytest test suite
│   ├── test_database.py     # Database tests
│   ├── test_processor.py    # Processor tests
│   ├── test_model.py        # Model tests
│   ├── test_integration.py  # Integration tests
│   ├── test_inference.py    # Inference tests
│   └── test_farming.py      # Farming tests
├── models/                  # Saved model files (git-ignored)
└── logs/                    # Training logs (git-ignored)

Requirements

  • Python 3.8+
  • MySQL 5.7+
  • See requirements.txt for Python packages

Database Setup

The project expects a local MySQL database with:

  • Database name: tinker
  • Port: 3306
  • User: tinker
  • Password: Provided via --db-password command-line argument

Tables:

  • stocks: Contains id (INT), ticker (VARCHAR), name (VARCHAR)
  • quotes: Contains price (VARCHAR), timestamp (DATETIME), stock (INT foreign key)

Installation

  1. Clone the repository:
git clone <repository-url>
cd pytink
  1. Create a virtual environment:
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
  1. Install dependencies and package:
pip install -r requirements.txt
pip install -e .
  1. Verify installation:
pytest tests/ -q

Quick Start

# See all options
pytink-train -h

# Run with defaults (20 stocks, 10 epochs)
pytink-train --db-password YOUR_PASSWORD

# Custom configuration
pytink-train --db-password YOUR_PASSWORD --num-stocks 10 --epochs 5 --interval 15

Usage

Training a Single Model

Run the model training script:

pytink-train --db-password YOUR_PASSWORD --num-stocks 10 --interval 15 --context-window-size 8 --batch-size 64

The script performs the following workflow:

  1. Connect to Database: Establish MySQL connection and verify tables
  2. Select Stocks: Randomly select specified number of stocks with sufficient data
  3. Fetch Historical Data: Retrieve price quotes for selected stocks
  4. Process Price Data: Convert price time series into delta sequences (market hours only)
  5. Generate Words: Encode deltas as letter sequences (a-g)
  6. Analyze Patterns: Display top 10 most common price movement patterns
  7. Prepare Dataset: Create PyTorch dataset with input-output pairs
  8. Train Model: Run training loop with AdamW optimizer
  9. Evaluate: Calculate loss, accuracy, and perplexity metrics
  10. Save Model: Save trained model to models/<TICKERS>_model.pt

Market Hours Awareness

The processor automatically:

  • Skips weekends (Saturday/Sunday)
  • Skips US market holidays
  • Only processes data during market hours (9:30 AM - 4:00 PM ET)
  • Resets price baselines at each market open to avoid cross-day artifacts

Configuration

Parameters can be adjusted via command line or config_template.yaml:

  • --db-password: Database password (required)
  • --config: Path to YAML configuration file
  • --num-stocks: Number of stocks to analyze (default: 20)
  • --interval: Price sampling interval in minutes (default: 30)
  • --context-window-size: Context window for model input (default: 16)
  • --batch-size: Training batch size (default: 64)
  • --learning-rate: AdamW optimizer learning rate (default: 1e-5)
  • --epochs: Number of training epochs (default: 10)
  • --save-model: Save trained model to disk (default: True)

Custom Delta Ranges

You can customize the delta encoding ranges in your config file:

delta_ranges:
  - -0.05    # a: -5%
  - -0.02    # b: -2%
  - -0.01    # c: -1%
  -  0.0     # d: 0%
  -  0.01    # e: +1%
  -  0.02    # f: +2%
  -  0.05    # g: +5%

Model Architecture

The transformer model uses:

  • Vocabulary Size: Number of unique words in the dataset
  • Hidden Size: 128 dimensions (default)
  • Layers: 4 transformer layers (default)
  • Attention Heads: 4 (default)
  • Position Embeddings: Up to 256 tokens

Module Documentation

pytink.database

StockDatabase class provides:

  • connect(): Establish MySQL connection
  • get_all_stocks(): Retrieve all stocks
  • get_random_stocks(count): Get random stock sample
  • get_quotes_for_stock(stock_id): Fetch quotes for one stock
  • get_quotes_for_stocks(stock_ids): Fetch quotes for multiple stocks

pytink.processor

PriceProcessor class provides:

  • parse_price(price_str): Convert price strings to floats
  • calculate_delta(old_price, new_price): Calculate percentage change
  • delta_to_symbol(delta): Map delta to letter
  • symbol_to_delta(symbol): Map letter back to delta
  • align_quotes_by_time(quotes_dict, stock_ids): Align quotes from multiple stocks
  • extract_words(quotes_dict, stock_ids): Generate words from price data
  • count_unique_words(words): Count vocabulary size

pytink.model

StockTransformerModel class provides:

  • forward(input_ids, labels): Forward pass with optional loss computation
  • predict(input_ids): Generate predictions
  • train() / eval(): Set model mode

StockWordDataset class:

  • PyTorch Dataset for word sequences
  • Returns (input_ids, label) pairs for training

pytink.inference

Evaluate trained models on recent data:

# Evaluate a trained model on the last 3 months of data
pytink-infer --db-password YOUR_PASSWORD --model-dir models/AAPL-GOOGL-MSFT/20260101_120000/

# Evaluate on the last 6 months
pytink-infer --db-password YOUR_PASSWORD --model-dir models/AAPL-GOOGL-MSFT/20260101_120000/ --months 6

The script:

  • Loads model configuration and weights from the specified directory
  • Fetches quotes for the model's tickers from the database
  • Evaluates on data from the last N months (default: 3)
  • Reports overall accuracy, loss, perplexity, and per-stock confusion matrices

Model Farming

Model farming automatically generates and evolves a pool of random stock prediction models, searching for high-accuracy combinations of stocks and hyperparameters.

# Run with defaults (100-model pool, 10 generations)
pytink-farm --db-password YOUR_PASSWORD

# Smaller, faster run for experimentation
pytink-farm --db-password YOUR_PASSWORD --num-models 20 --num-generations 3

# Custom parameters
pytink-farm --db-password YOUR_PASSWORD \
  --num-models 50 --num-generations 5 \
  --min-stocks 5 --max-stocks 10 \
  --epochs 10 --interval 15 --batch-size 32

The farming pipeline:

  1. Cold Start: Trains --num-models randomly configured models (random stock sets within the configured min/max range)
  2. Generational Cycles: Each generation keeps the top 25% of the pool and replaces the bottom 75% with newly trained random models
  3. Leaderboard: Prints a ranked table of the top 10 models by accuracy at the end

Each model trained by the farm is appended to models.parquet at the project root, recording tickers, accuracy, loss, perplexity, and all training parameters. This file accumulates across runs.

Farming CLI parameters

Parameter Default Description
--db-password (required) Database password
--num-models 100 Pool size
--num-generations 10 Generational replacement cycles
--min-stocks 5 Minimum stocks per model
--max-stocks 15 Maximum stocks per model
--epochs 5 Training epochs per model
--interval 30 Price sampling interval in minutes
--batch-size 64 Training batch size
--learning-rate 0.0003 Adam optimizer learning rate

pytink.farming

ModelFarm class provides:

  • cold_start(): Populate the pool with randomly generated models
  • run(): Full pipeline — cold start, generational cycles, leaderboard display
  • display_top_models(n): Print ranked leaderboard of top N models

Performance Metrics

The analysis script calculates:

  • Loss: Cross-entropy loss on the dataset
  • Accuracy: Percentage of correct predictions
  • Perplexity: Exp(loss), a common NLP metric

Notes

  • With 10 stocks and 7 delta levels, the maximum possible vocabulary is 7^10 ≈ 282 million words, but actual data typically contains far fewer unique words
  • The model learns patterns in how stock prices change together
  • Database connectivity is required; ensure MySQL is running before starting
  • Models are saved to models/<TICKERS>_model.pt by default

Testing

Run the test suite:

# All tests
pytest tests/ -v

# Specific module
pytest tests/test_processor.py -v

# With coverage
pytest tests/ --cov=src/pytink --cov-report=term-missing

Future Enhancements

  • Implement train/validation/test splits
  • Try different model architectures (increased layers, attention heads)
  • Add regularization techniques (dropout, layer normalization)
  • Generate longer sequences (multi-step ahead predictions)
  • Analyze prediction patterns for trading signals
  • Add support for other markets (extended hours, international exchanges)

Documentation

  • QUICKSTART.md: 5-minute getting started guide
  • EXAMPLES.md: Detailed usage examples
  • ALGORITHM_DETAILS.md: Technical deep-dive into the encoding scheme
  • PROJECT_SUMMARY.md: Architecture overview

License

This project is licensed under the Apache License, Version 2.0. See LICENSE for details.

About

Transformer-based stock price prediction

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages