Skip to content

SolvingLab/DataMiner

Repository files navigation

DataMiner

Intelligent biomedical data preparation — search, download, clean, standardize from any source.

DataMiner is an AI-powered data mining agent. Users describe what data they need in natural language, and the Agent finds it, downloads it, cleans it, standardizes it, and delivers ready-to-analyze files.

Features

  • Natural Language Interface — Describe what you need, get processed data
  • Multi-Source Coverage — GEO (27K+), recount3/TCGA/GTEx (18.5K), cBioPortal (450+), DataHub (22K+)
  • Unified Dataset Index — 43,500+ datasets searchable by keyword, organism, disease, tissue
  • Online Search Fallback — Auto-supplements local results with live GEO API + web search
  • Multi-Format Export — CSV, TSV, RDS, h5ad
  • Code Sandbox — Execute Python/R/Bash in isolated environments (Docker or subprocess)
  • 531 GPL Platform Annotations — Self-evolving microarray probe-to-gene mappings

Architecture

User Chat → FastAPI SSE → MinerAgent (ReAct dual-loop)
                              ↓ tool dispatch (registry pattern)
                 ┌─────────────┼──────────────┐
           search_datasets   execute_code    ask_user
           (host: PG query    (sandbox:       (pause for
            + online fallback) Python/R/Bash)  user input)
  • Backend: FastAPI + PostgreSQL 16 (pgvector) + asyncpg
  • Frontend: React 18 + TypeScript + Zustand + Tailwind CSS + SSE streaming
  • LLM: DeepSeek V3 (OpenAI-compatible API)
  • Web Search: ZhipuAI search_pro
  • Knowledge: Hot-reloadable .md files in agent/knowledge/

Quick Start

Development (subprocess mode)

# 1. Start PostgreSQL
docker compose up db -d

# 2. Backend
cd backend
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
cp ../.env.example .env  # then edit .env with your API keys
uvicorn app.main:app --reload --port 8893

# 3. Frontend (separate terminal)
cd frontend
npm install
npx vite --port 5176

Open http://localhost:5176 and register/login.

Production (Docker)

# Build sandbox image first (takes 20-30 min for R packages)
docker build -t dataminer-sandbox -f Dockerfile.sandbox .

# Start everything
docker compose up -d

Configuration

Copy .env.example to backend/.env and fill in:

Variable Required Description
LLM_API_KEY Yes DeepSeek API key
LLM_BASE_URL Yes https://api.deepseek.com/v1
LLM_MODEL Yes deepseek-chat
GLM_API_KEY Yes ZhipuAI API key (for web search)
JWT_SECRET Yes python -c "import secrets; print(secrets.token_hex(32))"
POSTGRES_PASSWORD Yes Database password
DATABASE_URL Yes PostgreSQL connection string

Project Structure

backend/app/
├── agent/          # MinerAgent (dual-loop ReAct) + prompts + tools + knowledge
├── executor/       # Subprocess (dev) + Docker (prod) code executors
│   └── libs/       # Sandbox libraries (acquire, geo, xena, cbio, gdc, process, export)
├── memory/         # 3-layer conversation memory (short/mid/long-term)
├── services/       # Auth, sessions, file management, dataset index, online search
├── main.py         # FastAPI entry point
└── config.py       # Environment configuration

frontend/src/
├── components/     # React components (chat, sidebar, data panel, dataset cards)
├── services/       # API client + SSE streaming
├── store.ts        # Zustand state management
├── hooks/          # Custom React hooks
├── i18n/           # Chinese + English translations
└── types/          # TypeScript type definitions

scripts/
├── migrate_v2.py           # Database migration
├── import_recount3.py      # recount3 dataset import (18,579 records)
├── import_cbioportal.py    # cBioPortal import (451 records)
├── import_geo_basic.py     # GEO metadata import (27,525 records)
├── import_datahub.py       # DataHub import (21,980+ records)
└── retry_geo_errors.py     # GEO network error retry

GEO_Annotation/             # 531 GPL platforms + genome annotations

Data Sources

Source Records Data Type
GEO RNA-seq 27,525 Microarray & RNA-seq counts
recount3 18,579 Uniformly processed RNA-seq (TCGA/GTEx/SRA)
DataHub 21,980+ Multi-omics metadata
cBioPortal 451 Cancer genomics studies

License

MIT

About

AI-Powered Biomedical Data Mining Agent — search, download, clean, standardize from any source

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors