A transformer-based model that treats stock price movements as a sequence-to-sequence modeling problem, predicting future prices from previous sequences of prices.
Just as transformer models learn to predict the next word in a sentence, this system learns to predict the next "word" of price movements across a portfolio of stocks. Each "word" encodes simultaneous price changes across stocks over a given time interval.
Stock prices don't move in isolation. The idea here is to see if an attention-based approach can work to learn relationships between stock movements over time. Models focus on a vector of stocks. The length is configurable, defaulting to 20 randomly selected high-volume stocks. Each (configurable) time increment, changes are recorded for each stock in the vector. The changes are quantized into (configurable) bins (e.g, [-.01, -.005, -.0001, 0, .0001, .005, .01]) which are mapped to letters. The letters are concatentated to form tokens and the transformer model is trained to predict the next token in the sequence.
- Quantize price changes into discrete symbols (a-g representing -1% to +1%)
- Create tokens by concatenating symbols for all stocks at each time interval
- Build sequences of consecutive tokens
- Train a transformer to predict the next token given previous tokens
The model learns which price movement patterns tend to follow other patterns—essentially learning the "grammar" of market movements.
This project trains a transformer model to predict the next sequence of stock price changes given a history of previous sequences. Stock price changes are encoded as letters representing different percentage change ranges.
Delta-encoding maps price changes to letters. Encodings are configurable. The default is:
Price change
a: -1.0% (-.01)b: -0.5% (-.005)c: -0.1% (-.001)d: 0.0% (0)e: +0.1% (+.001)f: +0.5% (+.005)g: +1.0% (+.01)
Actual price changes are mapped to the nearest neighbor from the list above.
Note that this makes all changes of magnitude more than 1% map to 'a' or 'g'.
Tokens are formed by concatenating the change encodings for each of the stocks in a list of stocks included in a model.
The token acgaeb means:
- Stock 1:
a(-1%) - Stock 2:
c(-0.1%) - Stock 3:
g(+1%) - Stock 4:
a(-1%) - Stock 5:
e(+0.1%) - Stock 6:
b(-0.5%)
pytink/
├── pyproject.toml # Package metadata & entry points
├── requirements.txt # Python dependencies
├── config_template.yaml # Configuration template
├── test_installation.py # Verify installation
├── src/
│ └── pytink/ # Main package
│ ├── database.py # MySQL database interface
│ ├── processor.py # Price processing and delta encoding
│ ├── model.py # PyTorch model and dataset classes
│ ├── analysis.py # Visualization utilities
│ ├── farming.py # Automated model generation
│ ├── train_model.py # Training CLI
│ └── inference.py # Evaluate trained models
├── tests/ # pytest test suite
│ ├── test_database.py # Database tests
│ ├── test_processor.py # Processor tests
│ ├── test_model.py # Model tests
│ ├── test_integration.py # Integration tests
│ ├── test_inference.py # Inference tests
│ └── test_farming.py # Farming tests
├── models/ # Saved model files (git-ignored)
└── logs/ # Training logs (git-ignored)
- Python 3.8+
- MySQL 5.7+
- See
requirements.txtfor Python packages
The project expects a local MySQL database with:
- Database name:
tinker - Port: 3306
- User:
tinker - Password: Provided via
--db-passwordcommand-line argument
Tables:
stocks: Containsid(INT),ticker(VARCHAR),name(VARCHAR)quotes: Containsprice(VARCHAR),timestamp(DATETIME),stock(INT foreign key)
- Clone the repository:
git clone <repository-url>
cd pytink- Create a virtual environment:
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate- Install dependencies and package:
pip install -r requirements.txt
pip install -e .- Verify installation:
pytest tests/ -q# See all options
pytink-train -h
# Run with defaults (20 stocks, 10 epochs)
pytink-train --db-password YOUR_PASSWORD
# Custom configuration
pytink-train --db-password YOUR_PASSWORD --num-stocks 10 --epochs 5 --interval 15Run the model training script:
pytink-train --db-password YOUR_PASSWORD --num-stocks 10 --interval 15 --context-window-size 8 --batch-size 64The script performs the following workflow:
- Connect to Database: Establish MySQL connection and verify tables
- Select Stocks: Randomly select specified number of stocks with sufficient data
- Fetch Historical Data: Retrieve price quotes for selected stocks
- Process Price Data: Convert price time series into delta sequences (market hours only)
- Generate Words: Encode deltas as letter sequences (a-g)
- Analyze Patterns: Display top 10 most common price movement patterns
- Prepare Dataset: Create PyTorch dataset with input-output pairs
- Train Model: Run training loop with AdamW optimizer
- Evaluate: Calculate loss, accuracy, and perplexity metrics
- Save Model: Save trained model to
models/<TICKERS>_model.pt
The processor automatically:
- Skips weekends (Saturday/Sunday)
- Skips US market holidays
- Only processes data during market hours (9:30 AM - 4:00 PM ET)
- Resets price baselines at each market open to avoid cross-day artifacts
Parameters can be adjusted via command line or config_template.yaml:
- --db-password: Database password (required)
- --config: Path to YAML configuration file
- --num-stocks: Number of stocks to analyze (default: 20)
- --interval: Price sampling interval in minutes (default: 30)
- --context-window-size: Context window for model input (default: 16)
- --batch-size: Training batch size (default: 64)
- --learning-rate: AdamW optimizer learning rate (default: 1e-5)
- --epochs: Number of training epochs (default: 10)
- --save-model: Save trained model to disk (default: True)
You can customize the delta encoding ranges in your config file:
delta_ranges:
- -0.05 # a: -5%
- -0.02 # b: -2%
- -0.01 # c: -1%
- 0.0 # d: 0%
- 0.01 # e: +1%
- 0.02 # f: +2%
- 0.05 # g: +5%The transformer model uses:
- Vocabulary Size: Number of unique words in the dataset
- Hidden Size: 128 dimensions (default)
- Layers: 4 transformer layers (default)
- Attention Heads: 4 (default)
- Position Embeddings: Up to 256 tokens
StockDatabase class provides:
connect(): Establish MySQL connectionget_all_stocks(): Retrieve all stocksget_random_stocks(count): Get random stock sampleget_quotes_for_stock(stock_id): Fetch quotes for one stockget_quotes_for_stocks(stock_ids): Fetch quotes for multiple stocks
PriceProcessor class provides:
parse_price(price_str): Convert price strings to floatscalculate_delta(old_price, new_price): Calculate percentage changedelta_to_symbol(delta): Map delta to lettersymbol_to_delta(symbol): Map letter back to deltaalign_quotes_by_time(quotes_dict, stock_ids): Align quotes from multiple stocksextract_words(quotes_dict, stock_ids): Generate words from price datacount_unique_words(words): Count vocabulary size
StockTransformerModel class provides:
forward(input_ids, labels): Forward pass with optional loss computationpredict(input_ids): Generate predictionstrain()/eval(): Set model mode
StockWordDataset class:
- PyTorch Dataset for word sequences
- Returns (input_ids, label) pairs for training
Evaluate trained models on recent data:
# Evaluate a trained model on the last 3 months of data
pytink-infer --db-password YOUR_PASSWORD --model-dir models/AAPL-GOOGL-MSFT/20260101_120000/
# Evaluate on the last 6 months
pytink-infer --db-password YOUR_PASSWORD --model-dir models/AAPL-GOOGL-MSFT/20260101_120000/ --months 6The script:
- Loads model configuration and weights from the specified directory
- Fetches quotes for the model's tickers from the database
- Evaluates on data from the last N months (default: 3)
- Reports overall accuracy, loss, perplexity, and per-stock confusion matrices
Model farming automatically generates and evolves a pool of random stock prediction models, searching for high-accuracy combinations of stocks and hyperparameters.
# Run with defaults (100-model pool, 10 generations)
pytink-farm --db-password YOUR_PASSWORD
# Smaller, faster run for experimentation
pytink-farm --db-password YOUR_PASSWORD --num-models 20 --num-generations 3
# Custom parameters
pytink-farm --db-password YOUR_PASSWORD \
--num-models 50 --num-generations 5 \
--min-stocks 5 --max-stocks 10 \
--epochs 10 --interval 15 --batch-size 32The farming pipeline:
- Cold Start: Trains
--num-modelsrandomly configured models (random stock sets within the configured min/max range) - Generational Cycles: Each generation keeps the top 25% of the pool and replaces the bottom 75% with newly trained random models
- Leaderboard: Prints a ranked table of the top 10 models by accuracy at the end
Each model trained by the farm is appended to models.parquet at the project root, recording tickers, accuracy, loss, perplexity, and all training parameters. This file accumulates across runs.
| Parameter | Default | Description |
|---|---|---|
| --db-password | (required) | Database password |
| --num-models | 100 | Pool size |
| --num-generations | 10 | Generational replacement cycles |
| --min-stocks | 5 | Minimum stocks per model |
| --max-stocks | 15 | Maximum stocks per model |
| --epochs | 5 | Training epochs per model |
| --interval | 30 | Price sampling interval in minutes |
| --batch-size | 64 | Training batch size |
| --learning-rate | 0.0003 | Adam optimizer learning rate |
ModelFarm class provides:
cold_start(): Populate the pool with randomly generated modelsrun(): Full pipeline — cold start, generational cycles, leaderboard displaydisplay_top_models(n): Print ranked leaderboard of top N models
The analysis script calculates:
- Loss: Cross-entropy loss on the dataset
- Accuracy: Percentage of correct predictions
- Perplexity: Exp(loss), a common NLP metric
- With 10 stocks and 7 delta levels, the maximum possible vocabulary is 7^10 ≈ 282 million words, but actual data typically contains far fewer unique words
- The model learns patterns in how stock prices change together
- Database connectivity is required; ensure MySQL is running before starting
- Models are saved to
models/<TICKERS>_model.ptby default
Run the test suite:
# All tests
pytest tests/ -v
# Specific module
pytest tests/test_processor.py -v
# With coverage
pytest tests/ --cov=src/pytink --cov-report=term-missing- Implement train/validation/test splits
- Try different model architectures (increased layers, attention heads)
- Add regularization techniques (dropout, layer normalization)
- Generate longer sequences (multi-step ahead predictions)
- Analyze prediction patterns for trading signals
- Add support for other markets (extended hours, international exchanges)
- QUICKSTART.md: 5-minute getting started guide
- EXAMPLES.md: Detailed usage examples
- ALGORITHM_DETAILS.md: Technical deep-dive into the encoding scheme
- PROJECT_SUMMARY.md: Architecture overview
This project is licensed under the Apache License, Version 2.0. See LICENSE for details.