MAPS (Medical Annotation Processing System) is a comprehensive Python-based application designed to parse, analyze, and export medical imaging annotation data from various medical imaging systems and file formats. Built specifically for handling complex medical imaging session data with multiple observer readings, nodule annotations, coordinate mappings, and research literature.
MAPS provides a powerful FastAPI REST backend supporting XML, JSON, PDF, and ZIP files, with advanced analytics, keyword extraction, and seamless Supabase integration for scalable data management.
This system was developed to address the challenges of processing heterogeneous medical annotation data formats, providing researchers and medical professionals with tools to:
- XML: Parse LIDC-IDRI and other medical imaging annotations
- JSON: Process structured annotation data
- PDF: Extract keywords from research papers and documentation
- ZIP: Batch process entire datasets with automatic extraction
- Folders: Recursive directory processing with multi-file support
- Extract observer readings, confidence scores, and nodule characteristics
- Handle multi-session observer reviews and unblinded readings
- Export data to standardized Excel templates and SQLite databases
- Import PYLIDC data directly to Supabase PostgreSQL
- Schema-agnostic parsing with automatic parse case detection
- Automatic keyword extraction from medical documents and PDFs
- Perform advanced analytics on radiologist agreement and data quality
- Process up to 1000 files per batch with real-time progress tracking
Import radiology data from PYLIDC to Supabase PostgreSQL with automatic parse case detection and keyword extraction.
- Set up Supabase: Create a project at supabase.com
- Configure: Copy
.env.exampleto.envand add your Supabase credentials - Migrate: Apply database schema:
psql "$SUPABASE_DB_URL" -f migrations/*.sql - Import: Run
python scripts/pylidc_to_supabase.py --limit 10
** Full guide**: docs/QUICKSTART_SUPABASE.md
Schema-Agnostic Design: Automatically detects XML structure patterns PYLIDC Integration: Direct import from LIDC-IDRI dataset Parse Case Tracking: Know which XML schema was used for each document Keyword Extraction: Automatic medical term extraction with categories JSONB Storage: Flexible PostgreSQL storage with GIN indexes Full-Text Search: Fast document search by keywords and content Analytics Ready: Materialized views and helper functions included
from maps.database.enhanced_document_repository import EnhancedDocumentRepository
from maps.adapters.pylidc_adapter import PyLIDCAdapter
import pylidc as pl
# Initialize repository with parse case and keyword tracking
repo = EnhancedDocumentRepository(
enable_parse_case_tracking=True,
enable_keyword_extraction=True
)
# Import PYLIDC scan
adapter = PyLIDCAdapter()
scan = pl.query(pl.Scan).first()
canonical_doc = adapter.scan_to_canonical(scan)
# Insert with automatic detection
doc, content, parse_case, keywords = repo.insert_canonical_document_enhanced(
canonical_doc,
source_file=f"pylidc://{scan.patient_id}",
detect_parse_case=True,
extract_keywords=True
)
print(f"Imported: {scan.patient_id}")
print(f"Parse case: {parse_case}")
print(f"Keywords extracted: {keywords}")** Documentation**:
- Quick Start Guide - Get started in 5 minutes
- Schema-Agnostic Guide - Complete architecture documentation
- Examples - Usage examples
Fully automatic keyword extraction, case detection, and analytics on EVERY import using database triggers!
The system automatically processes ALL imports (XML, PDF, LIDC, JSON) through a complete pipeline:
ANY IMPORT → Automatic Triggers → Keywords Extracted → Case Detected → Views Updated → Ready for Analysis
- Triggers on INSERT: Automatic keyword extraction from all segment types
- Hybrid Case Detection: Filename regex (1.0 confidence) + keyword signature (0.0-1.0)
- Confidence Thresholding: Auto-assign ≥0.8, manual review <0.8
- Cross-Type Validation: Keywords appearing in both qualitative and quantitative segments
file_summary- Per-file aggregated statisticssegment_statistics- Per-segment metrics (word count, numeric density, keywords)numeric_data_flat- Auto-extracted numeric fields from JSONBcases_with_evidence- Established cases with linked dataunresolved_segments- Orphaned data needing assignmentcase_identifier_validation- Completeness metrics with actionable recommendations
lidc_patient_summary- Patient-level consensus (9 characteristics: subtlety, malignancy, etc.)lidc_nodule_analysis- Per-nodule with per-radiologist columnslidc_patient_cases- Case-level rollup with TCIA linkslidc_3d_contours- Spatial coordinates for 3D visualizationlidc_contour_slices- Per-slice polygon datalidc_nodule_spatial_stats- Derived spatial statistics
export_universal_wide- All data types, flattenedexport_lidc_analysis_ready- SPSS/R/Stata format (one row per radiologist rating)export_lidc_with_links- Patient summary with TCIA download linksexport_radiologist_data- Inter-rater analysis formatexport_top_keywords- Top 1000 keywords by relevance
- All export views accessible to anonymous users
- LIDC medical views (de-identified data)
- Universal analysis views
- Internal processing tables restricted to authenticated users
- Curated Medical Concepts: Lung-RADS®, RadLex, LIDC-IDRI, TCIA, Radiomics, cTAKES, NER
- Categories: Standardization Systems, Diagnostic Concepts, Imaging Biomarkers, Performance Metrics
- AMA Citations: Full references to source papers and documentation
- Topic Tags: Filtering by "LIDC", "Radiomics", "NLP", "Reporting", "Biomarkers", etc.
- Bidirectional Navigation: Keyword → Files/Segments/Cases AND File/Case → Keywords
keyword_directory- Complete catalog with usage stats and citationskeyword_occurrence_map- Where-used at segment levelfile_keyword_summary- Keywords per filecase_keyword_summary- Keywords per casekeyword_subject_category_summary- Rollup by categorykeyword_topic_tag_summary- Rollup by tag
The system includes complete 3D contour processing utilities:
from maps.lidc_3d_utils import (
extract_nodule_mesh,
calculate_consensus_contour,
compute_inter_rater_reliability,
generate_3d_visualization,
get_tcia_download_script
)
# Extract 3D mesh for 3D printing
mesh_path = extract_nodule_mesh("LIDC-IDRI-0001", "1", contour_data, "stl")
# Calculate consensus from multiple radiologists
consensus = calculate_consensus_contour([rad1, rad2, rad3, rad4], method='average')
# Compute inter-rater reliability
ratings = {
"malignancy": [4, 5, 4, 4],
"subtlety": [3, 3, 4, 3]
}
metrics = compute_inter_rater_reliability(ratings)
print(f"ICC: {metrics['malignancy_icc']:.3f}")
# Generate interactive 3D visualization
html_path = generate_3d_visualization("LIDC-IDRI-0001", "1", contour_data)The complete system is deployed via 14 SQL migrations:
- 001_initial_schema - Core tables (already existed)
- 002_unified_case_identifier - Schema-agnostic foundation (already existed)
- 003-005 - Various enhancements (already existed)
- 006_automatic_triggers - Keyword extraction triggers NEW
- 007_case_detection_system - Hybrid case detection NEW
- 008_universal_views - Cross-format views NEW
- 009_lidc_specific_views - Medical analysis views NEW
- 010_lidc_3d_contour_views - Spatial visualization NEW
- 011_export_views - CSV-ready materialized views NEW
- 012_public_access_policies - RLS for anonymous read NEW
- 013_keyword_semantics - Canonical keywords + citations NEW
- 014_keyword_navigation_views - Keyword discovery NEW
# Apply all migrations (run in order)
for i in {001..014}; do
psql "$SUPABASE_DB_URL" -f migrations/$(printf "%03d" $i)*.sql
done
# Refresh all export views
psql "$SUPABASE_DB_URL" -c "SELECT * FROM refresh_all_export_views();"
# Backfill canonical keyword links
psql "$SUPABASE_DB_URL" -c "SELECT * FROM backfill_canonical_keyword_ids();"
# Check database statistics
psql "$SUPABASE_DB_URL" -c "SELECT * FROM public_database_statistics;"# Query keyword directory
from sqlalchemy import create_engine
engine = create_engine(os.getenv("SUPABASE_DB_URL"))
# Get all keywords in a category
query = """
SELECT * FROM keyword_directory
WHERE subject_category = 'Radiologist Perceptive and Diagnostic Concepts'
ORDER BY total_occurrences DESC
"""
keywords = pd.read_sql(query, engine)
# Get canonical keywords for a specific file
query = "SELECT * FROM get_file_canonical_keywords(%s)"
file_keywords = pd.read_sql(query, engine, params=[file_id])
# Search by topic tag
query = "SELECT * FROM search_keywords_by_tag('LIDC')"
lidc_keywords = pd.read_sql(query, engine)
# Get where a keyword is used
query = "SELECT * FROM get_canonical_keyword_occurrences('malignancy')"
occurrences = pd.read_sql(query, engine)- Keywords Tab: Browse canonical keywords, filter by category/tag
- Keyword Detail Modal: Click any keyword → see all files/segments/cases
- Clickable Keyword Chips: Throughout the dashboard for easy navigation
- TCIA Integration: Direct links to study pages and DICOM downloads
- 3D Visualization: In-browser nodule rendering with Plotly
- Case Assignment Interface: Manual review queue for confidence <0.8
** Complete Documentation**: Analysis and Export System Guide
src/maps/
├── parser.py # Core XML parsing engine
├── api/ # FastAPI REST backend
│ ├── app.py # Application factory
│ ├── routers/ # API endpoints
│ └── models.py # Pydantic models
├── schemas/ # Canonical data schemas
├── adapters/ # External integrations (pylidc)
├── keyword_*.py # Keyword extraction system
├── auto_analysis.py # Automatic entity extraction
└── profile_manager.py # Profile system
Files (XML/JSON/PDF/ZIP) → Parser Engine → Canonical Schema → Export/Database
↓ ↓ ↓ ↓
Format Detection Validation Normalization Excel/SQLite/Supabase
- Language: Python 3.8+
- API Framework: FastAPI
- Data Processing: Pandas, Pydantic v2
- Excel Operations: OpenPyXL
- Database: SQLite3, Supabase (PostgreSQL)
- XML Processing: lxml, ElementTree
- PDF Processing: pdfplumber
- parse_radiology_sample(): Main XML parsing function
- detect_parse_case(): Intelligent XML structure detection
- parse_multiple(): Batch processing with memory optimization
- Multi-format support (LIDC, custom formats)
- FastAPI application with modular routers
- Endpoints for parsing, export, keywords, profiles, analysis
- Pydantic request/response models
- CanonicalDocument: Schema-agnostic data normalization
- Profile: Configurable parsing profiles
- Field validation and transformation rules
- XMLKeywordExtractor: Medical term extraction from XML
- PDFKeywordExtractor: Keyword extraction from PDFs
- KeywordNormalizer: Synonym handling and normalization
- KeywordSearchEngine: Full-text search capabilities
- PyLIDCAdapter: LIDC-IDRI dataset integration
- NYT Format: Standard radiology XML with ResponseHeader structure
- LIDC Format: Lung Image Database Consortium XML structure
- Custom Formats: Extensible parsing for new XML schemas
- Automatic Detection: Intelligent format recognition
Complete_Attributes - Full radiologist data (confidence, subtlety, obscuration, reason)
With_Reason_Partial - Includes reason field with partial attributes
Core_Attributes_Only - Essential attributes without reason
Minimal_Attributes - Limited attribute set
No_Characteristics - Structure without characteristic data
LIDC_Single_Session - Single LIDC reading session
LIDC_Multi_Session_X - Multiple LIDC sessions (2-4 radiologists)
No_Sessions_Found - XML without readable sessions
XML_Parse_Error - Malformed or unparseable XML
Detection_Error - Structure analysis failure
- Radiologist Information: ID, session type, reading timestamps
- Nodule Characteristics: Confidence, subtlety, obscuration, diagnostic reason
- Coordinate Data: X, Y, Z coordinates with edge mapping
- Medical Metadata: StudyInstanceUID, SeriesInstanceUID, SOP_UID, modality
- Session Classification: Standard vs. Detailed coordinate sessions
- Individual XML file parsing
- Immediate feedback on parse results
- Error handling and reporting
- Data preview capabilities
- Recursive XML file discovery
- Batch processing with progress tracking
- Per-folder statistics and reporting
- Error isolation (continue on failure)
- Combined Output: Single Excel file with multiple sheets
- Folder Organization: Separate sheet per source folder
- Template Format: Radiologist 1-4 repeating column structure
- Single Database: Combined SQLite database for all folders
- Progress Tracking: Real-time processing updates with live logging
- Parse case sheets: Separate sheets by XML structure type
- Session separation: Detailed vs. Standard coordinate sessions
- Color coding: Parse case-based row highlighting
- Missing value highlighting: Orange highlighting for MISSING values
- Auto-formatting: Column width adjustment and alignment
- Radiologist Columns: Repeating "Radiologist 1", "Radiologist 2", "Radiologist 3", "Radiologist 4"
- Compact Ratings: Format like "Conf:5 | Sub:3 | Obs:2 | Reason:1"
- Color Coordination: Each radiologist column gets unique color scheme
- Comprehensive Headers: FileID, NoduleID, ParseCase, SessionType, coordinates, metadata
- Combined Sheet: "All Combined" with data from all folders
- Individual Sheets: One sheet per source folder
- Consistent Formatting: Template format across all sheets
- Navigation: Easy switching between folder views
-- Core tables for relational data organization
sessions - Individual radiologist reading sessions
nodules - Unique nodule instances with metadata
radiologists - Radiologist information and statistics
files - Source file tracking and metadata
batches - Processing batch management
quality_issues - Data quality problem tracking- Radiologist Agreement: Inter-rater reliability calculations
- Data Quality Metrics: Completeness, consistency analysis
- Performance Statistics: Processing time and success rates
- Batch Tracking: Historical processing information
- SQL query interface for custom analysis
- Predefined analytical views
- Export capabilities to Excel with formatting
- Integration with external analysis tools
- Missing Value Detection: Identification of MISSING vs #N/A vs empty values
- Data Completeness Analysis: Per-column and overall completeness statistics
- Type Validation: Ensuring numeric fields contain valid numbers
- Structure Validation: XML schema compliance checking
- Quality Warnings: User prompts for data quality issues
- Continue/Cancel Options: User choice on problematic data
- Detailed Reporting: Comprehensive quality statistics
- Column Hiding: Auto-hide columns with >85% missing values
- Graceful Degradation: Continue processing on individual file failures
- Error Logging: Detailed error tracking with timestamps
- User Feedback: Clear error messages and resolution suggestions
- Recovery Options: Partial processing results preservation
- Batch Processing: Process files in configurable batches
- Garbage Collection: Explicit memory cleanup
- Data Streaming: Minimize memory footprint for large datasets
- Efficient Data Structures: Optimized data organization
- Smart Sampling: Intelligent sampling for column width calculation
- Vectorized Operations: Pandas optimization for data manipulation
- Batch Database Operations: Efficient SQLite bulk insertions
- Parallel Processing Ready: Architecture supports future parallelization
- Core XML Parsing Engine - Multi-format XML processing
- FastAPI REST Backend - Complete API with 8 router modules
- Excel Export System - Multiple export formats with rich formatting
- SQLite Database Integration - Relational database with analytics
- Supabase Integration - Cloud PostgreSQL with full-text search
- Schema-Agnostic Design - Canonical document normalization
- Keyword Extraction - Automatic medical term extraction
- Profile System - Configurable parsing profiles
- PYLIDC Adapter - LIDC-IDRI dataset integration
- Statistical Analysis: Advanced inter-rater reliability metrics
- Machine Learning Integration: Anomaly detection in radiologist readings
- Predictive Modeling: Quality prediction based on XML structure
- Batch Comparison: Compare processing results across batches
- Parallel Processing: Multi-core processing for large batches
- Cloud Integration: AWS/Azure processing capabilities
- API Development: REST API for automated processing
- Docker Containerization: Deployment and scaling support
- DICOM Integration: Support for DICOM file processing
- Web Interface: Browser-based processing interface
- Real-time Monitoring: Live processing dashboards
- Integration APIs: Connect with hospital information systems
- Natural Language Processing: Extract insights from reason text
- Computer Vision: Image coordinate validation
- Automated Quality Assessment: AI-powered data quality scoring
- Predictive Analytics: Forecast processing outcomes
- Python: 3.8 or higher
- RAM: Minimum 4GB, Recommended 8GB+
- Storage: 1GB+ free space for databases and exports
- OS: Windows 10+, macOS 10.14+, Linux (Ubuntu 18.04+)
- Small Dataset (<1,000 files): ~2-5 minutes
- Medium Dataset (1,000-10,000 files): ~10-30 minutes
- Large Dataset (10,000+ files): ~30+ minutes
- Memory Usage: ~100-500MB typical, scales with dataset size
# Core Dependencies
pandas>=1.3.0
openpyxl>=3.0.7
lxml>=4.6.0
pydantic>=2.0.0
# API Framework
fastapi>=0.100.0
uvicorn[standard]>=0.23.0
# PDF Processing
pdfplumber>=0.9.0
# Optional
pylidc>=0.2.0 # For LIDC-IDRI integration# Clone the repository
git clone <repository-url>
cd MAPS-core
# Install dependencies
pip install -r requirements.txt
# Run the API server
python scripts/run_server.py# Create virtual environment
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install development dependencies
pip install -r requirements.txt
# Run tests
pytest -v
# Start API server
uvicorn src.maps.api.app:create_app --factory --reloadfrom maps import parse_radiology_sample, export_excel
# Parse single file
main_df, unblinded_df = parse_radiology_sample("sample.xml")
# Export to Excel
export_excel(main_df, "output.xlsx")# Start server
uvicorn src.maps.api.app:create_app --factory
# Parse file via API
curl -X POST "http://localhost:8000/parse/file" \
-F "file=@sample.xml"
# Health check
curl http://localhost:8000/healthfrom maps import parse_multiple, export_excel
import pandas as pd
# Parse multiple files
files = ["file1.xml", "file2.xml", "file3.xml"]
main_dfs, unblinded_dfs = parse_multiple(files)
# Combine and export
combined = pd.concat(main_dfs.values(), ignore_index=True)
export_excel(combined, "batch_results.xlsx")- Single-threaded Processing: No parallel processing yet
- Memory Usage: Large datasets can consume significant RAM
- XML Format Support: Limited to known formats (extensible)
- Error Recovery: Some XML errors cannot be automatically resolved
- Very Large Files: Files >100MB may process slowly
- Special Characters: Some Unicode characters in XML may cause issues
- Network Drives: Processing from network locations may be slower
- macOS Permissions: May require permissions for file access
- Large Datasets: Process in smaller batches
- Memory Issues: Close other applications during processing
- File Errors: Check XML validity before processing
- Performance: Use local storage for better performance
- Follow PEP 8 Python style guidelines
- Add docstrings to all functions and classes
- Include type hints where appropriate
- Write tests for new features
- Update documentation for changes
- src/maps/parser.py: Core parsing functionality
- src/maps/api/: FastAPI routers and models
- src/maps/schemas/: Pydantic data models
- tests/: Test suite
- Test with various XML formats
- Verify export formats work correctly
- Check error handling with malformed data
- Performance test with large datasets
** Testing Documentation:**
- Testing Guide - Comprehensive testing documentation
- Quick Reference - Quick commands and tips
Run Tests:
# Web tests
cd web/ && npm test
# Python tests
pytest -v
# Coverage reports
npm run test:coverage # web
pytest --cov=src --cov-report=html # pythonCI/CD: Tests run automatically on push/PR via GitHub Actions (.github/workflows/test.yml)
MAPS is proprietary software with dual licensing:
- Free for academic research and education
- Must cite in publications
- No commercial use permitted
- See LICENSE for full terms
- Required for any for-profit use
- Includes support and updates
- Custom pricing based on use case
- Contact for commercial licensing
Copyright (c) 2025 Isa Lucia Schlichting. All Rights Reserved.
If you use MAPS in academic research, please cite:
@software{schlichting2025maps,
author = {Schlichting, Isa Lucia},
title = {MAPS: Medical Annotation Processing System},
year = {2025},
publisher = {GitHub},
url = {https://github.com/luvisaisa/MAPS}
}For commercial licensing, enterprise support, or questions:
- 📧 Email: isa.lucia.sch@outlook.com
- 📄 Details: COMMERCIAL_LICENSE.md
- 💻 Repository: https://github.com/luvisaisa/MAPS
- Repository: NYTXMLPARSE (GitHub)
- Author: luvisaisa
- Created: 2025
- Language: Python
For issues, questions, or contributions:
- Create an issue in the GitHub repository
- Review existing documentation
- Check known issues section
- Contact development team
Last Updated: August 12, 2025 Version: 2.0 Status: Active Development