Intelligent biomedical data preparation — search, download, clean, standardize from any source.
DataMiner is an AI-powered data mining agent. Users describe what data they need in natural language, and the Agent finds it, downloads it, cleans it, standardizes it, and delivers ready-to-analyze files.
- Natural Language Interface — Describe what you need, get processed data
- Multi-Source Coverage — GEO (27K+), recount3/TCGA/GTEx (18.5K), cBioPortal (450+), DataHub (22K+)
- Unified Dataset Index — 43,500+ datasets searchable by keyword, organism, disease, tissue
- Online Search Fallback — Auto-supplements local results with live GEO API + web search
- Multi-Format Export — CSV, TSV, RDS, h5ad
- Code Sandbox — Execute Python/R/Bash in isolated environments (Docker or subprocess)
- 531 GPL Platform Annotations — Self-evolving microarray probe-to-gene mappings
User Chat → FastAPI SSE → MinerAgent (ReAct dual-loop)
↓ tool dispatch (registry pattern)
┌─────────────┼──────────────┐
search_datasets execute_code ask_user
(host: PG query (sandbox: (pause for
+ online fallback) Python/R/Bash) user input)
- Backend: FastAPI + PostgreSQL 16 (pgvector) + asyncpg
- Frontend: React 18 + TypeScript + Zustand + Tailwind CSS + SSE streaming
- LLM: DeepSeek V3 (OpenAI-compatible API)
- Web Search: ZhipuAI search_pro
- Knowledge: Hot-reloadable
.mdfiles inagent/knowledge/
# 1. Start PostgreSQL
docker compose up db -d
# 2. Backend
cd backend
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
cp ../.env.example .env # then edit .env with your API keys
uvicorn app.main:app --reload --port 8893
# 3. Frontend (separate terminal)
cd frontend
npm install
npx vite --port 5176Open http://localhost:5176 and register/login.
# Build sandbox image first (takes 20-30 min for R packages)
docker build -t dataminer-sandbox -f Dockerfile.sandbox .
# Start everything
docker compose up -dCopy .env.example to backend/.env and fill in:
| Variable | Required | Description |
|---|---|---|
LLM_API_KEY |
Yes | DeepSeek API key |
LLM_BASE_URL |
Yes | https://api.deepseek.com/v1 |
LLM_MODEL |
Yes | deepseek-chat |
GLM_API_KEY |
Yes | ZhipuAI API key (for web search) |
JWT_SECRET |
Yes | python -c "import secrets; print(secrets.token_hex(32))" |
POSTGRES_PASSWORD |
Yes | Database password |
DATABASE_URL |
Yes | PostgreSQL connection string |
backend/app/
├── agent/ # MinerAgent (dual-loop ReAct) + prompts + tools + knowledge
├── executor/ # Subprocess (dev) + Docker (prod) code executors
│ └── libs/ # Sandbox libraries (acquire, geo, xena, cbio, gdc, process, export)
├── memory/ # 3-layer conversation memory (short/mid/long-term)
├── services/ # Auth, sessions, file management, dataset index, online search
├── main.py # FastAPI entry point
└── config.py # Environment configuration
frontend/src/
├── components/ # React components (chat, sidebar, data panel, dataset cards)
├── services/ # API client + SSE streaming
├── store.ts # Zustand state management
├── hooks/ # Custom React hooks
├── i18n/ # Chinese + English translations
└── types/ # TypeScript type definitions
scripts/
├── migrate_v2.py # Database migration
├── import_recount3.py # recount3 dataset import (18,579 records)
├── import_cbioportal.py # cBioPortal import (451 records)
├── import_geo_basic.py # GEO metadata import (27,525 records)
├── import_datahub.py # DataHub import (21,980+ records)
└── retry_geo_errors.py # GEO network error retry
GEO_Annotation/ # 531 GPL platforms + genome annotations
| Source | Records | Data Type |
|---|---|---|
| GEO RNA-seq | 27,525 | Microarray & RNA-seq counts |
| recount3 | 18,579 | Uniformly processed RNA-seq (TCGA/GTEx/SRA) |
| DataHub | 21,980+ | Multi-omics metadata |
| cBioPortal | 451 | Cancer genomics studies |
MIT