Skip to content

Complete Package Restructuring to Modern Python Standards#1

Open
tsenoner wants to merge 13 commits into
mainfrom
feature/restructure-package
Open

Complete Package Restructuring to Modern Python Standards#1
tsenoner wants to merge 13 commits into
mainfrom
feature/restructure-package

Conversation

@tsenoner

Copy link
Copy Markdown
Collaborator

Complete Package Restructuring to Modern Python Standards

🎯 Overview

Transform taxembed from a research codebase with scattered scripts into a production-ready Python package following modern best practices (PEP 517, 518, 621).

🚀 Key Changes

Package Structure

  • Adopted src-layout: All code now in src/taxembed/ with logical module organization
  • Proper packaging: Models, training, data, visualization, analysis, validation, builders, CLI
  • Vendored original: Moved Facebook's code to _vendor/ for reference

Configuration & Tooling

  • Unified config: Single pyproject.toml with all settings (Ruff, MyPy, Pytest)
  • Type safety: Added MyPy with strict checking and gradual adoption strategy
  • Code quality: Ruff for linting/formatting (all checks passing)
  • Removed: Makefile, ruff.toml, requirements.txt

Bug Fixes

  • CLI commands: Now properly use package modules instead of non-existent scripts
  • Path resolution: Fixed visualization to find data directories correctly
  • Auto-extraction: Taxonomy dump files extracted automatically when needed

Documentation

  • Development guides: Added comprehensive Ruff, MyPy, and Pytest usage
  • Consolidated: Removed duplicate docs, organized remaining into clean structure
  • Enhanced: README, CONTRIBUTING, and user guide now include full dev workflows

Cleanup

  • Removed 40+ obsolete scripts from project root
  • Cleaned up .gitignore with proper exclusions
  • Organized examples into dedicated directory

✅ Quality Assurance

All quality checks passing:

  • ✅ Ruff linting and formatting
  • ✅ MyPy type checking (32 source files)
  • ✅ Pytest test suite

🔄 Migration Impact

BREAKING CHANGE: Complete restructure - no backwards compatibility.

Before: 40+ scattered scripts in root
After: Clean src/taxembed/ package structure

CLI interface unchanged - all commands work the same way:

uv run taxembed train Cnidaria -as cnidaria
uv run taxembed visualize cnidaria

Note: Project was not in production use.

📦 What This Enables

  • PyPI Publishing - Proper metadata and structure
  • CI/CD Integration - Clean test/lint commands
  • Type Safety - Gradual type hint adoption
  • Maintainability - Clear module boundaries
  • Contribution - Professional structure with tooling

Result: A production-ready, professionally structured Python package ready for future development.

- Update project metadata (keywords, classifiers for PyPI)
- Reorganize sections in canonical PEP 621 order
- Update version to 1.0.0
- Add comprehensive package metadata
- Move all Ruff settings from ruff.toml to pyproject.toml
- Delete ruff.toml file
- Add C901 complexity rule to ignore list
- Add strict MyPy configuration to pyproject.toml
- Configure module-level overrides for gradual migration
- Enable disallow_untyped_defs with override exceptions
- Add mypy to dev dependencies
- Move all code to src/taxembed/ package
- Create logical modules: models, training, data, visualization, analysis, validation, builders, cli
- Add proper __init__.py files for clean imports
- Organize code into maintainable structure
- Delete train_small.py, train_hierarchical.py
- Remove analyze_hierarchy*.py files
- Clean up build_transitive_closure.py and other root scripts
- All functionality now in src/taxembed/ package
- Remove docs/archive/ directory (30+ redundant files)
- Remove obsolete documentation files
- Keep README.md, CONTRIBUTING.md, docs/user-guide.md, docs/theory.md
- Consolidate scattered docs into clean structure
- Preserve original poincare-embeddings code for reference
- Add _vendor/README.md explaining provenance
- Train command uses python -m taxembed.cli.train
- Visualize command uses python -m taxembed.visualization.umap_viz
- Remove non-existent script path references
- All CLI commands now properly use package structure

Note: Changes included in previous structure commit
- Fix path resolution to find project root (3 levels up)
- Add auto-extraction of .dmp files via ensure_taxdump()
- Remove wrong data directory path calculation
- Ensure visualization finds taxonomy files

Note: Changes included in previous structure commit
- Add Development section to README.md with tool quickstart
- Expand CONTRIBUTING.md with detailed Ruff, MyPy, Pytest guides
- Add Development section to docs/user-guide.md
- Add docs/theory.md with mathematical background
- Include code quality workflow examples
- Add .mypy_cache/ for MyPy type checker
- Use /data/ to ignore only root directory, not src/taxembed/data/
- Remove obsolete entries (wordnet, tox, old venv names)
- Simplify and organize ignore patterns
- Delete Makefile in favor of pyproject.toml scripts
- Remove requirements.txt (using pyproject.toml)
- Remove QUICKSTART.md (info now in README.md)
- Clean up remaining hype/ Cython files
- Remove scripts/cleanup and regenerate scripts
- All commands now via uv and pyproject.toml
- Add examples/ directory with demonstration scripts
- Add comprehensive test suite (conftest, test_models, test_training, test_data, test_validation)
- Add CODE_OF_CONDUCT.md
- Update uv.lock with latest dependencies
- Complete package restructuring
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant