DataMiner

Intelligent biomedical data preparation — search, download, clean, standardize from any source.

DataMiner is an AI-powered data mining agent. Users describe what data they need in natural language, and the Agent finds it, downloads it, cleans it, standardizes it, and delivers ready-to-analyze files.

Features

Natural Language Interface — Describe what you need, get processed data
Multi-Source Coverage — GEO (27K+), recount3/TCGA/GTEx (18.5K), cBioPortal (450+), DataHub (22K+)
Unified Dataset Index — 43,500+ datasets searchable by keyword, organism, disease, tissue
Online Search Fallback — Auto-supplements local results with live GEO API + web search
Multi-Format Export — CSV, TSV, RDS, h5ad
Code Sandbox — Execute Python/R/Bash in isolated environments (Docker or subprocess)
531 GPL Platform Annotations — Self-evolving microarray probe-to-gene mappings

Architecture

User Chat → FastAPI SSE → MinerAgent (ReAct dual-loop)
                              ↓ tool dispatch (registry pattern)
                 ┌─────────────┼──────────────┐
           search_datasets   execute_code    ask_user
           (host: PG query    (sandbox:       (pause for
            + online fallback) Python/R/Bash)  user input)

Backend: FastAPI + PostgreSQL 16 (pgvector) + asyncpg
Frontend: React 18 + TypeScript + Zustand + Tailwind CSS + SSE streaming
LLM: DeepSeek V3 (OpenAI-compatible API)
Web Search: ZhipuAI search_pro
Knowledge: Hot-reloadable .md files in agent/knowledge/

Quick Start

Development (subprocess mode)

# 1. Start PostgreSQL
docker compose up db -d

# 2. Backend
cd backend
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
cp ../.env.example .env  # then edit .env with your API keys
uvicorn app.main:app --reload --port 8893

# 3. Frontend (separate terminal)
cd frontend
npm install
npx vite --port 5176

Open http://localhost:5176 and register/login.

Production (Docker)

# Build sandbox image first (takes 20-30 min for R packages)
docker build -t dataminer-sandbox -f Dockerfile.sandbox .

# Start everything
docker compose up -d

Configuration

Copy .env.example to backend/.env and fill in:

Variable	Required	Description
`LLM_API_KEY`	Yes	DeepSeek API key
`LLM_BASE_URL`	Yes	`https://api.deepseek.com/v1`
`LLM_MODEL`	Yes	`deepseek-chat`
`GLM_API_KEY`	Yes	ZhipuAI API key (for web search)
`JWT_SECRET`	Yes	`python -c "import secrets; print(secrets.token_hex(32))"`
`POSTGRES_PASSWORD`	Yes	Database password
`DATABASE_URL`	Yes	PostgreSQL connection string

Project Structure

backend/app/
├── agent/          # MinerAgent (dual-loop ReAct) + prompts + tools + knowledge
├── executor/       # Subprocess (dev) + Docker (prod) code executors
│   └── libs/       # Sandbox libraries (acquire, geo, xena, cbio, gdc, process, export)
├── memory/         # 3-layer conversation memory (short/mid/long-term)
├── services/       # Auth, sessions, file management, dataset index, online search
├── main.py         # FastAPI entry point
└── config.py       # Environment configuration

frontend/src/
├── components/     # React components (chat, sidebar, data panel, dataset cards)
├── services/       # API client + SSE streaming
├── store.ts        # Zustand state management
├── hooks/          # Custom React hooks
├── i18n/           # Chinese + English translations
└── types/          # TypeScript type definitions

scripts/
├── migrate_v2.py           # Database migration
├── import_recount3.py      # recount3 dataset import (18,579 records)
├── import_cbioportal.py    # cBioPortal import (451 records)
├── import_geo_basic.py     # GEO metadata import (27,525 records)
├── import_datahub.py       # DataHub import (21,980+ records)
└── retry_geo_errors.py     # GEO network error retry

GEO_Annotation/             # 531 GPL platforms + genome annotations

Data Sources

Source	Records	Data Type
GEO RNA-seq	27,525	Microarray & RNA-seq counts
recount3	18,579	Uniformly processed RNA-seq (TCGA/GTEx/SRA)
DataHub	21,980+	Multi-omics metadata
cBioPortal	451	Cancer genomics studies

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DataMiner

Features

Architecture

Quick Start

Development (subprocess mode)

Production (Docker)

Configuration

Project Structure

Data Sources

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
GEO_Annotation		GEO_Annotation
backend		backend
frontend		frontend
scripts		scripts
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
Dockerfile.sandbox		Dockerfile.sandbox
README.md		README.md
docker-compose.yml		docker-compose.yml

Folders and files

Latest commit

History

Repository files navigation

DataMiner

Features

Architecture

Quick Start

Development (subprocess mode)

Production (Docker)

Configuration

Project Structure

Data Sources

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages