<!-- t2s/
├── configs/
│ └── config.py # ModelConfig, TrainConfig
├── data/
│ ├── tokenizer.py # encode / decode / special tokens
│ ├── dataset.py # AlignedSQLDataset, BinDataset, CurriculumSampler
│ ├── preprocess.py # clean_txt_file, split_and_tokenize
│ └── loader.py # load_datasets (builds/caches .pt files)
├── model/
│ └── gpt.py # LayerNorm, Attention, MLP, Block, GPT, masked_sql_loss
├── training/
│ ├── trainer.py # train_stage, get_lr, evaluate
│ └── checkpoint.py # save_checkpoint, load_checkpoint
├── inference/
│ └── generate.py # generate_sql, extract_sql, load_best_model
├── utils/
│ └── plot.py # training curve plots
├── scripts/
│ ├── train.py # full curriculum training entrypoint
│ ├── finetune.py # fine-tune from checkpoint
│ └── evaluate_spider.py # Spider benchmark evaluation
└── README.md
``` -->
## Usage
### 1. Clean raw data
```python
from data.preprocess import clean_all
clean_all()