Stock Price Prediction with Transformers

A transformer-based model that treats stock price movements as a sequence-to-sequence modeling problem, predicting future prices from previous sequences of prices.

Motivation

Just as transformer models learn to predict the next word in a sentence, this system learns to predict the next "word" of price movements across a portfolio of stocks. Each "word" encodes simultaneous price changes across stocks over a given time interval.

Stock prices don't move in isolation. The idea here is to see if an attention-based approach can work to learn relationships between stock movements over time. Models focus on a vector of stocks. The length is configurable, defaulting to 20 randomly selected high-volume stocks. Each (configurable) time increment, changes are recorded for each stock in the vector. The changes are quantized into (configurable) bins (e.g, [-.01, -.005, -.0001, 0, .0001, .005, .01]) which are mapped to letters. The letters are concatentated to form tokens and the transformer model is trained to predict the next token in the sequence.

How It Works

Quantize price changes into discrete symbols (a-g representing -1% to +1%)
Create tokens by concatenating symbols for all stocks at each time interval
Build sequences of consecutive tokens
Train a transformer to predict the next token given previous tokens

The model learns which price movement patterns tend to follow other patterns—essentially learning the "grammar" of market movements.

Project Overview

This project trains a transformer model to predict the next sequence of stock price changes given a history of previous sequences. Stock price changes are encoded as letters representing different percentage change ranges.

Delta Encoding

Delta-encoding maps price changes to letters. Encodings are configurable. The default is:

Price change

a: -1.0% (-.01)
b: -0.5% (-.005)
c: -0.1% (-.001)
d: 0.0% (0)
e: +0.1% (+.001)
f: +0.5% (+.005)
g: +1.0% (+.01)

Actual price changes are mapped to the nearest neighbor from the list above.

Note that this makes all changes of magnitude more than 1% map to 'a' or 'g'.

Tokens

Tokens are formed by concatenating the change encodings for each of the stocks in a list of stocks included in a model.

Example

The token acgaeb means:

Stock 1: a (-1%)
Stock 2: c (-0.1%)
Stock 3: g (+1%)
Stock 4: a (-1%)
Stock 5: e (+0.1%)
Stock 6: b (-0.5%)

Project Structure

pytink/
├── pyproject.toml           # Package metadata & entry points
├── requirements.txt         # Python dependencies
├── config_template.yaml     # Configuration template
├── test_installation.py     # Verify installation
├── src/
│   └── pytink/              # Main package
│       ├── database.py      # MySQL database interface
│       ├── processor.py     # Price processing and delta encoding
│       ├── model.py         # PyTorch model and dataset classes
│       ├── analysis.py      # Visualization utilities
│       ├── farming.py       # Automated model generation
│       ├── train_model.py   # Training CLI
│       └── inference.py     # Evaluate trained models
├── tests/                   # pytest test suite
│   ├── test_database.py     # Database tests
│   ├── test_processor.py    # Processor tests
│   ├── test_model.py        # Model tests
│   ├── test_integration.py  # Integration tests
│   ├── test_inference.py    # Inference tests
│   └── test_farming.py      # Farming tests
├── models/                  # Saved model files (git-ignored)
└── logs/                    # Training logs (git-ignored)

Requirements

Python 3.8+
MySQL 5.7+
See requirements.txt for Python packages

Database Setup

The project expects a local MySQL database with:

Database name: tinker
Port: 3306
User: tinker
Password: Provided via --db-password command-line argument

Tables:

stocks: Contains id (INT), ticker (VARCHAR), name (VARCHAR)
quotes: Contains price (VARCHAR), timestamp (DATETIME), stock (INT foreign key)

Installation

Clone the repository:

git clone <repository-url>
cd pytink

Create a virtual environment:

python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

Install dependencies and package:

pip install -r requirements.txt
pip install -e .

Verify installation:

pytest tests/ -q

Quick Start

# See all options
pytink-train -h

# Run with defaults (20 stocks, 10 epochs)
pytink-train --db-password YOUR_PASSWORD

# Custom configuration
pytink-train --db-password YOUR_PASSWORD --num-stocks 10 --epochs 5 --interval 15

Usage

Training a Single Model

Run the model training script:

pytink-train --db-password YOUR_PASSWORD --num-stocks 10 --interval 15 --context-window-size 8 --batch-size 64

The script performs the following workflow:

Connect to Database: Establish MySQL connection and verify tables
Select Stocks: Randomly select specified number of stocks with sufficient data
Fetch Historical Data: Retrieve price quotes for selected stocks
Process Price Data: Convert price time series into delta sequences (market hours only)
Generate Words: Encode deltas as letter sequences (a-g)
Analyze Patterns: Display top 10 most common price movement patterns
Prepare Dataset: Create PyTorch dataset with input-output pairs
Train Model: Run training loop with AdamW optimizer
Evaluate: Calculate loss, accuracy, and perplexity metrics
Save Model: Save trained model to models/<TICKERS>_model.pt

Market Hours Awareness

The processor automatically:

Skips weekends (Saturday/Sunday)
Skips US market holidays
Only processes data during market hours (9:30 AM - 4:00 PM ET)
Resets price baselines at each market open to avoid cross-day artifacts

Configuration

Parameters can be adjusted via command line or config_template.yaml:

--db-password: Database password (required)
--config: Path to YAML configuration file
--num-stocks: Number of stocks to analyze (default: 20)
--interval: Price sampling interval in minutes (default: 30)
--context-window-size: Context window for model input (default: 16)
--batch-size: Training batch size (default: 64)
--learning-rate: AdamW optimizer learning rate (default: 1e-5)
--epochs: Number of training epochs (default: 10)
--save-model: Save trained model to disk (default: True)

Custom Delta Ranges

You can customize the delta encoding ranges in your config file:

delta_ranges:
  - -0.05    # a: -5%
  - -0.02    # b: -2%
  - -0.01    # c: -1%
  -  0.0     # d: 0%
  -  0.01    # e: +1%
  -  0.02    # f: +2%
  -  0.05    # g: +5%

Model Architecture

The transformer model uses:

Vocabulary Size: Number of unique words in the dataset
Hidden Size: 128 dimensions (default)
Layers: 4 transformer layers (default)
Attention Heads: 4 (default)
Position Embeddings: Up to 256 tokens

Module Documentation

`pytink.database`

StockDatabase class provides:

connect(): Establish MySQL connection
get_all_stocks(): Retrieve all stocks
get_random_stocks(count): Get random stock sample
get_quotes_for_stock(stock_id): Fetch quotes for one stock
get_quotes_for_stocks(stock_ids): Fetch quotes for multiple stocks

`pytink.processor`

PriceProcessor class provides:

parse_price(price_str): Convert price strings to floats
calculate_delta(old_price, new_price): Calculate percentage change
delta_to_symbol(delta): Map delta to letter
symbol_to_delta(symbol): Map letter back to delta
align_quotes_by_time(quotes_dict, stock_ids): Align quotes from multiple stocks
extract_words(quotes_dict, stock_ids): Generate words from price data
count_unique_words(words): Count vocabulary size

`pytink.model`

StockTransformerModel class provides:

forward(input_ids, labels): Forward pass with optional loss computation
predict(input_ids): Generate predictions
train() / eval(): Set model mode

StockWordDataset class:

PyTorch Dataset for word sequences
Returns (input_ids, label) pairs for training

`pytink.inference`

Evaluate trained models on recent data:

# Evaluate a trained model on the last 3 months of data
pytink-infer --db-password YOUR_PASSWORD --model-dir models/AAPL-GOOGL-MSFT/20260101_120000/

# Evaluate on the last 6 months
pytink-infer --db-password YOUR_PASSWORD --model-dir models/AAPL-GOOGL-MSFT/20260101_120000/ --months 6

The script:

Loads model configuration and weights from the specified directory
Fetches quotes for the model's tickers from the database
Evaluates on data from the last N months (default: 3)
Reports overall accuracy, loss, perplexity, and per-stock confusion matrices

Model Farming

Model farming automatically generates and evolves a pool of random stock prediction models, searching for high-accuracy combinations of stocks and hyperparameters.

# Run with defaults (100-model pool, 10 generations)
pytink-farm --db-password YOUR_PASSWORD

# Smaller, faster run for experimentation
pytink-farm --db-password YOUR_PASSWORD --num-models 20 --num-generations 3

# Custom parameters
pytink-farm --db-password YOUR_PASSWORD \
  --num-models 50 --num-generations 5 \
  --min-stocks 5 --max-stocks 10 \
  --epochs 10 --interval 15 --batch-size 32

The farming pipeline:

Cold Start: Trains --num-models randomly configured models (random stock sets within the configured min/max range)
Generational Cycles: Each generation keeps the top 25% of the pool and replaces the bottom 75% with newly trained random models
Leaderboard: Prints a ranked table of the top 10 models by accuracy at the end

Each model trained by the farm is appended to models.parquet at the project root, recording tickers, accuracy, loss, perplexity, and all training parameters. This file accumulates across runs.

Farming CLI parameters

Parameter	Default	Description
--db-password	(required)	Database password
--num-models	100	Pool size
--num-generations	10	Generational replacement cycles
--min-stocks	5	Minimum stocks per model
--max-stocks	15	Maximum stocks per model
--epochs	5	Training epochs per model
--interval	30	Price sampling interval in minutes
--batch-size	64	Training batch size
--learning-rate	0.0003	Adam optimizer learning rate

`pytink.farming`

ModelFarm class provides:

cold_start(): Populate the pool with randomly generated models
run(): Full pipeline — cold start, generational cycles, leaderboard display
display_top_models(n): Print ranked leaderboard of top N models

Performance Metrics

The analysis script calculates:

Loss: Cross-entropy loss on the dataset
Accuracy: Percentage of correct predictions
Perplexity: Exp(loss), a common NLP metric

Notes

With 10 stocks and 7 delta levels, the maximum possible vocabulary is 7^10 ≈ 282 million words, but actual data typically contains far fewer unique words
The model learns patterns in how stock prices change together
Database connectivity is required; ensure MySQL is running before starting
Models are saved to models/<TICKERS>_model.pt by default

Testing

Run the test suite:

# All tests
pytest tests/ -v

# Specific module
pytest tests/test_processor.py -v

# With coverage
pytest tests/ --cov=src/pytink --cov-report=term-missing

Future Enhancements

Implement train/validation/test splits
Try different model architectures (increased layers, attention heads)
Add regularization techniques (dropout, layer normalization)
Generate longer sequences (multi-step ahead predictions)
Analyze prediction patterns for trading signals
Add support for other markets (extended hours, international exchanges)

Documentation

QUICKSTART.md: 5-minute getting started guide
EXAMPLES.md: Detailed usage examples
ALGORITHM_DETAILS.md: Technical deep-dive into the encoding scheme
PROJECT_SUMMARY.md: Architecture overview

License

This project is licensed under the Apache License, Version 2.0. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
src/pytink		src/pytink
tests		tests
.gitignore		.gitignore
ALGORITHM_DETAILS.md		ALGORITHM_DETAILS.md
EXAMPLES.md		EXAMPLES.md
INDEX.md		INDEX.md
LICENSE		LICENSE
QUICKSTART.md		QUICKSTART.md
README.md		README.md
config_template.yaml		config_template.yaml
conftest.py		conftest.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
test_installation.py		test_installation.py

Folders and files

Latest commit

History

Repository files navigation

Stock Price Prediction with Transformers

Motivation

How It Works

Project Overview

Delta Encoding

Tokens

Example

Project Structure

Requirements

Database Setup

Installation

Quick Start

Usage

Training a Single Model

Market Hours Awareness

Configuration

Custom Delta Ranges

Model Architecture

Module Documentation

pytink.database

pytink.processor

pytink.model

pytink.inference

Model Farming

Farming CLI parameters

pytink.farming

Performance Metrics

Notes

Testing

Future Enhancements

Documentation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`pytink.database`

`pytink.processor`

`pytink.model`

`pytink.inference`

`pytink.farming`

Packages