From ea814ad5674134342b3a1b57acd99348e6aada3a Mon Sep 17 00:00:00 2001 From: Jiraya <177346249+intjiraya@users.noreply.github.com> Date: Mon, 25 May 2026 22:22:12 +0200 Subject: [PATCH 1/2] feat(search): full-text search, DSL, DNS-rebinding hardening, split rebuild MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Server - New `/api/search?q=...&limit=N` backed by an inverted index built at rebuild time. Lazy suffix-array (OnceLock) gives O(log V·L + matches) substring lookup ("auth" matches "authentication" via suffix lookup). - Split rebuild into two phases: metadata-only (projects + by_session_id published immediately) then a parallel background pass that parses every session body and populates the search index. `is_indexing_search` exposed on /api/stats so the UI can render the in-between state. - `parse_session_bodies_parallel` uses `std::thread::scope` with `available_parallelism()` workers — no new dependency. - Search handler clones `Arc` under the RwLock, drops the guard, and runs in `spawn_blocking`. Concurrent reindex no longer waits for in-flight searches. - `IndexedDoc` pre-computes lowercased char vector so `build_snippets` no longer allocates two `Vec` per call. - `Session::indexable_text()` (was `search::extract_session_text`) — text policy now lives with the parser domain model. Security - DNS-rebinding hardening: all `/api/*` routes require a loopback `Host` header; if `Origin` is present, it must also be loopback. WS routes keep their existing Origin check. - Search query string no longer recorded in tracing spans (only `term_count`) to avoid leaking secrets typed as queries. Client - DSL in the search bar: multi-term AND, quoted phrases, operators `project:`, `model:`, `has:tool|cache|model`, `before:YYYY-MM-DD`, `after:YYYY-MM-DD`. Lives in `static/search.mjs` as pure functions (`tokenize`, `parseQuery`, `matches`, `highlight`, `highlightSegments`, `segmentsFromRanges`, `escapeHtml`). - `applySearchFilter` is now hybrid: operators filter locally, plain terms hit `/api/search`, server snippet ranges drive highlighting directly (no second regex pass). - Sequence counter prevents stale results from racing keystrokes. - `ensureAllSessions` is a promise singleton; `last_scan` change in /api/stats invalidates the cache reactively (no more stale post-reindex). - All `fetch()` calls have a 10 s `AbortController` timeout. - Search query persists across reloads via `localStorage`. Tests - 19 new Rust unit tests in `search`, 3 in `parser`, 11 new integration tests in `http_integration` (DNS rebinding, search behaviour, clamp, quoted phrase, missing q). - 64 JS unit tests via `node --test`, wrapped through a Rust integration test so `cargo test` runs the whole pipeline. Bench - `cargo run --release --example bench_search` reports rebuild + RSS + per-query p50/p99 for 100 / 1000 / 5000 synthetic sessions. Docs - README: new "search" section with DSL examples; updated "blazing fast" table with metadata-ready vs search-ready; new benchmark table. - CHANGELOG: Unreleased section documents Added / Changed / Security / Performance. Co-Authored-By: Claude Opus 4.7 (1M context) --- README.md | 61 +++++- examples/bench_search.rs | 168 +++++++++++++++ src/dto.rs | 3 + src/http.rs | 76 ++++++- src/index.rs | 120 ++++++++++- src/lib.rs | 1 + src/parser.rs | 60 ++++++ src/search.rs | 439 ++++++++++++++++++++++++++++++++++++++ static/app.js | 247 ++++++++++++++++++--- static/index.html | 8 +- static/search.mjs | 204 ++++++++++++++++++ static/styles.css | 40 ++++ tests/http_integration.rs | 180 ++++++++++++++++ tests/js_search.rs | 26 +++ tests/search.test.mjs | 415 +++++++++++++++++++++++++++++++++++ 15 files changed, 2003 insertions(+), 45 deletions(-) create mode 100644 examples/bench_search.rs create mode 100644 src/search.rs create mode 100644 static/search.mjs create mode 100644 tests/js_search.rs create mode 100644 tests/search.test.mjs diff --git a/README.md b/README.md index ea6fde2..45bc041 100644 --- a/README.md +++ b/README.md @@ -62,26 +62,65 @@ cchats --root /custom/path # override ~/.claude/projects | feature | what it does | | :-------------------------------- | :---------------------------------------------------------------------------------------- | | `every chat in one place` | reads every `~/.claude/projects//*.jsonl`, groups by project, sorts by recency | +| `full-text search across chats` | server-side inverted index + suffix array for substring lookup, snippet highlighting | +| `smart client search` | multi-term AND, operators (`project:`, `model:`, `has:`, `before:`, `after:`), quotes | | `live resume in the browser` | click, spawns `claude --resume ` inside a PTY, bridged through WebSocket to xterm.js | | `fork without scarring` | one-click `--fork-session` from any chat, original untouched | | `new chat from the rail` | start a fresh `claude` session in any indexed project's cwd | | `token accounting` | input / cache-create / cache-read / output buckets, per chat, per project, all-up | -| `single 3.2 MiB binary` | rust, no runtime, no node, no python, `rust-embed` ships every asset inside | -| `loopback-only, origin-checked` | binds 127.0.0.1, rejects non-loopback `Origin`, strict CSP, vendored CDN scripts | +| `single binary` | rust, no runtime, no node, no python, `rust-embed` ships every asset inside | +| `DNS-rebinding hardened` | binds 127.0.0.1, rejects non-loopback `Host` and `Origin`, strict CSP, vendored scripts | + +
+ +## search + +Type in the search bar. Supports a small DSL: + +``` +auth bug multi-term AND across title / content +"merge conflict" quoted phrase +project:web filter by project (matches display path) +model:opus has:tool model contains "opus" AND has tool calls +has:cache before:2026-04-01 has cached tokens, last activity before date +auth project:api after:2026-01-01 combine freely +``` + +Plain terms hit the server's inverted index and search **inside the message +bodies** — user text, assistant text, thinking blocks, tool inputs and tool +outputs. Operators are evaluated client-side against session metadata.
## blazing fast -| metric | value | -| :------------------ | -----------: | -| cold start | 5.6 ms | -| index ready (152) | 447 ms | -| RSS idle | 18.6 MiB | -| `/api/stats` p50 | 0.08 ms | -| `/api/projects` p50 | 0.14 ms | -| big session parse | 27 ms | -| reindex 234 MiB | 430 ms | +Split rebuild publishes projects/sessions metadata immediately and indexes +search bodies in a parallel background phase. + +| metric | value | +| :----------------------- | -----------: | +| cold start (server up) | 5.6 ms | +| metadata ready (152 ses) | ~3 ms | +| search ready (152 ses) | ~150 ms | +| RSS idle | 18.6 MiB | +| `/api/stats` p50 | 0.08 ms | +| `/api/projects` p50 | 0.14 ms | +| big session parse | 27 ms | +| reindex 234 MiB | 430 ms | + +### search benchmark + +Synthetic JSONL bodies (~4 KiB each, 30-word vocab repeated, 1 turn per +session), single-threaded x86_64. Real Claude sessions tend to have lower term +density so latency drops accordingly. + +| sessions | rebuild() total | RSS Δ (KiB) | search "auth" p50 | search "auth tool" p50 | +| -------: | --------------: | ----------: | ----------------: | ---------------------: | +| 100 | ~8 ms | ~2.5 K| ~0.9 ms | ~1.4 ms | +| 1000 | ~58 ms | ~17 K| ~8.9 ms | ~15 ms | +| 5000 | ~282 ms | ~76 K| ~46 ms | ~76 ms | + +Reproduce locally: `cargo run --release --example bench_search`. Single binary, no runtime, no warmup. It just starts. diff --git a/examples/bench_search.rs b/examples/bench_search.rs new file mode 100644 index 0000000..3eff43a --- /dev/null +++ b/examples/bench_search.rs @@ -0,0 +1,168 @@ +use std::path::Path; +use std::time::Instant; + +use constellation::index::Index; +use tempfile::TempDir; + +const VOCAB: &[&str] = &[ + "authentication", + "authorization", + "session", + "database", + "migration", + "performance", + "logging", + "metrics", + "regex", + "parser", + "tokenizer", + "snapshot", + "rebuild", + "concurrent", + "websocket", + "router", + "middleware", + "handler", + "request", + "response", + "search", + "index", + "postings", + "suffix", + "binary", + "claude", + "anthropic", + "model", + "prompt", + "tool", +]; + +fn synth_body(seed: u64) -> String { + let mut out = String::with_capacity(4096); + for i in 0..200 { + let idx = ((seed.wrapping_mul(31).wrapping_add(i)) as usize) % VOCAB.len(); + out.push_str(VOCAB[idx]); + out.push(' '); + if i % 12 == 11 { + out.push('\n'); + } + } + out +} + +fn seed_session(project_dir: &Path, sid: &str, body: &str) { + std::fs::create_dir_all(project_dir).unwrap(); + let escaped: String = body + .chars() + .map(|c| match c { + '\n' => "\\n".to_string(), + '"' => "\\\"".to_string(), + '\\' => "\\\\".to_string(), + c => c.to_string(), + }) + .collect(); + let content = format!( + "{{\"type\":\"ai-title\",\"aiTitle\":\"bench\",\"sessionId\":\"{sid}\"}}\n\ +{{\"type\":\"user\",\"message\":{{\"role\":\"user\",\"content\":\"{escaped}\"}},\ +\"uuid\":\"u-1\",\"timestamp\":\"2026-05-25T11:00:00.000Z\",\"sessionId\":\"{sid}\",\"cwd\":\"/srv/x\"}}\n" + ); + std::fs::write(project_dir.join(format!("{sid}.jsonl")), content).unwrap(); +} + +fn rss_kib() -> Option { + let txt = std::fs::read_to_string("/proc/self/status").ok()?; + for line in txt.lines() { + if let Some(rest) = line.strip_prefix("VmRSS:") { + let n: u64 = rest.split_whitespace().next()?.parse().ok()?; + return Some(n); + } + } + None +} + +fn median_ns(samples: &mut [u128]) -> u128 { + samples.sort_unstable(); + samples[samples.len() / 2] +} + +fn p99_ns(samples: &mut [u128]) -> u128 { + samples.sort_unstable(); + samples[(samples.len() * 99 / 100).min(samples.len() - 1)] +} + +fn run(n_sessions: usize, queries: &[&str]) { + let tmp = TempDir::new().unwrap(); + + let seed_start = Instant::now(); + let n_projects = (n_sessions / 10).max(1); + for i in 0..n_sessions { + let proj = i % n_projects; + let proj_dir = tmp.path().join(format!("-bench-{proj:03}")); + seed_session(&proj_dir, &format!("sess-{i:05}"), &synth_body(i as u64)); + } + let seed_elapsed = seed_start.elapsed(); + + let rss_before = rss_kib(); + let idx = Index::new(tmp.path().to_owned()); + + let rebuild_start = Instant::now(); + idx.rebuild(); + let rebuild_total = rebuild_start.elapsed(); + let rss_after = rss_kib(); + + let snap = idx.read(); + let projects = snap.projects.len(); + let sessions = snap.by_session_id.len(); + let search_idx = snap.search_index.clone(); + drop(snap); + + println!(); + println!("=== N = {n_sessions} sessions / {n_projects} projects ==="); + println!( + "seed (synthetic JSONL write): {:>7} ms", + seed_elapsed.as_millis() + ); + println!( + "rebuild() total: {:>7} ms", + rebuild_total.as_millis() + ); + println!(" projects indexed: {projects}, sessions indexed: {sessions}"); + if let (Some(b), Some(a)) = (rss_before, rss_after) { + println!( + "RSS delta: {:>7} KiB ({} → {})", + a.saturating_sub(b), + b, + a + ); + } + + for q in queries { + let terms: Vec = q.split_whitespace().map(str::to_owned).collect(); + let mut samples = Vec::with_capacity(50); + for _ in 0..50 { + let t = Instant::now(); + let _hits = search_idx.search(&terms, 50); + samples.push(t.elapsed().as_nanos()); + } + let n = search_idx.search(&terms, 50).len(); + let med = median_ns(&mut samples); + let p99 = p99_ns(&mut samples); + println!( + "search {:<20} hits={:>5} p50 {:>6.2} µs p99 {:>6.2} µs", + format!("\"{q}\""), + n, + med as f64 / 1000.0, + p99 as f64 / 1000.0, + ); + } +} + +fn main() { + println!("constellation-rs search benchmark"); + println!("================================="); + println!("Note: synthetic data, 1 turn per session, ~4 KiB body each."); + + run(100, &["auth", "session", "model", "auth tool"]); + run(1_000, &["auth", "session", "model", "auth tool"]); + run(5_000, &["auth", "session", "model", "auth tool"]); +} diff --git a/src/dto.rs b/src/dto.rs index 1b38884..1ecc8a0 100644 --- a/src/dto.rs +++ b/src/dto.rs @@ -10,6 +10,7 @@ pub struct IndexStats { pub sessions: usize, pub last_scan: Option>, pub scanning: bool, + pub indexing_search: bool, pub total_usage: Usage, } @@ -53,6 +54,7 @@ mod tests { sessions: 7, last_scan: None, scanning: false, + indexing_search: false, total_usage: Usage { input: 1, cache_creation: 2, @@ -68,6 +70,7 @@ mod tests { "sessions": 7, "last_scan": null, "scanning": false, + "indexing_search": false, "total_usage": { "input": 1, "cache_creation": 2, diff --git a/src/http.rs b/src/http.rs index ba141b1..45ff036 100644 --- a/src/http.rs +++ b/src/http.rs @@ -18,6 +18,7 @@ use crate::dto::{IndexStats, ProjectOut}; use crate::index::Index; use crate::parser::{SessionMeta, Usage, parse_session}; use crate::pty::{spawn_new_chat_bridge, spawn_resume_bridge}; +use crate::search::{SearchHit, tokenize_text}; #[derive(RustEmbed)] #[folder = "static/"] @@ -74,6 +75,17 @@ pub struct ResumeQuery { pub fork: bool, } +#[derive(Debug, Deserialize, Default)] +pub struct SearchQueryParams { + #[serde(default)] + pub q: String, + #[serde(default)] + pub limit: Option, +} + +const DEFAULT_SEARCH_LIMIT: usize = 50; +const MAX_SEARCH_LIMIT: usize = 200; + pub fn build_router(state: AppState) -> Router { let ws_routes = Router::new() .route("/api/projects/{sanitized_name}/new-chat", get(ws_new_chat)) @@ -88,7 +100,9 @@ pub fn build_router(state: AppState) -> Router { "/api/projects/{sanitized_name}/sessions", get(list_project_sessions), ) - .route("/api/sessions/{session_id}", get(get_session)); + .route("/api/sessions/{session_id}", get(get_session)) + .route("/api/search", get(get_search)) + .layer(middleware::from_fn(reject_dns_rebinding)); Router::new() .merge(api_routes) @@ -140,6 +154,40 @@ async fn reject_non_loopback_origin(req: axum::extract::Request, next: Next) -> next.run(req).await } +async fn reject_dns_rebinding(req: axum::extract::Request, next: Next) -> Response { + let headers = req.headers(); + if !host_is_loopback(headers) { + warn!( + host = ?headers.get(header::HOST), + "rejecting: non-loopback Host header (possible DNS rebinding)", + ); + return (StatusCode::FORBIDDEN, "host not allowed").into_response(); + } + if headers.get(header::ORIGIN).is_some() && !origin_is_loopback(headers) { + warn!( + origin = ?headers.get(header::ORIGIN), + "rejecting: non-loopback Origin header", + ); + return (StatusCode::FORBIDDEN, "origin not allowed").into_response(); + } + next.run(req).await +} + +fn host_is_loopback(headers: &HeaderMap) -> bool { + let Some(host) = headers.get(header::HOST).and_then(|v| v.to_str().ok()) else { + return false; + }; + let host_only = if let Some(stripped) = host.strip_prefix('[') { + match stripped.split_once(']') { + Some((h, _)) => h, + None => return false, + } + } else { + host.split_once(':').map(|(h, _)| h).unwrap_or(host) + }; + host_only == "127.0.0.1" || host_only == "localhost" || host_only == "::1" +} + fn origin_is_loopback(headers: &HeaderMap) -> bool { let Some(origin) = headers.get(header::ORIGIN).and_then(|v| v.to_str().ok()) else { return false; @@ -186,6 +234,7 @@ fn build_index_stats(state: &AppState) -> IndexStats { sessions: snap.session_count(), last_scan: snap.last_scan, scanning: state.index.is_scanning(), + indexing_search: state.index.is_indexing_search(), total_usage, } } @@ -258,6 +307,31 @@ async fn get_session( Ok(([(header::CONTENT_TYPE, "application/json")], json).into_response()) } +#[instrument(skip_all, fields(term_count, limit = ?q.limit))] +async fn get_search( + State(s): State, + Query(q): Query, +) -> Result>, StatusCode> { + let terms: Vec = tokenize_text(&q.q); + tracing::Span::current().record("term_count", terms.len()); + if terms.is_empty() { + return Ok(Json(Vec::new())); + } + let limit = q + .limit + .unwrap_or(DEFAULT_SEARCH_LIMIT) + .clamp(1, MAX_SEARCH_LIMIT); + + let index = s.index.read().search_index.clone(); + let hits = tokio::task::spawn_blocking(move || index.search(&terms, limit)) + .await + .map_err(|e| { + error!(error = %e, "search task panicked"); + StatusCode::INTERNAL_SERVER_ERROR + })?; + Ok(Json(hits)) +} + async fn ws_resume( State(s): State, AxumPath(session_id): AxumPath, diff --git a/src/index.rs b/src/index.rs index b5d4fc1..86309ed 100644 --- a/src/index.rs +++ b/src/index.rs @@ -7,8 +7,41 @@ use chrono::{DateTime, Utc}; use parking_lot::{Mutex, RwLock}; use tracing::error; -use crate::parser::SessionMeta; +use crate::parser::{SessionMeta, parse_session}; use crate::scanner::{ProjectInfo, default_root, scan_projects}; +use crate::search::SearchIndex; + +fn parse_session_bodies_parallel(refs: &[(String, std::path::PathBuf)]) -> Vec<(String, String)> { + if refs.is_empty() { + return Vec::new(); + } + let workers = std::thread::available_parallelism() + .map(|n| n.get()) + .unwrap_or(4) + .min(refs.len()); + let chunk_size = refs.len().div_ceil(workers); + std::thread::scope(|scope| { + let handles: Vec<_> = refs + .chunks(chunk_size) + .map(|chunk| { + scope.spawn(move || { + chunk + .iter() + .map(|(id, path)| { + let session = parse_session(path); + (id.clone(), session.indexable_text()) + }) + .collect::>() + }) + }) + .collect(); + let mut out = Vec::with_capacity(refs.len()); + for h in handles { + out.extend(h.join().expect("parse worker panicked")); + } + out + }) +} #[derive(Default)] pub struct Snapshot { @@ -16,6 +49,7 @@ pub struct Snapshot { pub by_project: HashMap, pub by_session_id: HashMap, pub last_scan: Option>, + pub search_index: Arc, } impl Snapshot { @@ -63,6 +97,7 @@ pub struct Index { root: PathBuf, state: Arc>, scanning: Arc>, + indexing_search: Arc>, rebuild_lock: Arc>, } @@ -73,6 +108,7 @@ impl Index { root, state: Arc::new(RwLock::new(Snapshot::default())), scanning: Arc::new(Mutex::new(false)), + indexing_search: Arc::new(Mutex::new(false)), rebuild_lock: Arc::new(Mutex::new(())), } } @@ -89,26 +125,79 @@ impl Index { *self.scanning.lock() } + pub fn is_indexing_search(&self) -> bool { + *self.indexing_search.lock() + } + + #[tracing::instrument(skip(self), fields(root = ?self.root))] pub fn rebuild(&self) { let _serialise = self.rebuild_lock.lock(); let _flag = ScanFlag::set(&self.scanning); + let started = std::time::Instant::now(); let projects = scan_projects(&self.root); let mut by_project = HashMap::with_capacity(projects.len()); let mut by_session = HashMap::new(); + let mut session_refs: Vec<(String, std::path::PathBuf)> = Vec::new(); + for (idx, p) in projects.iter().enumerate() { by_project.insert(p.sanitized_name.clone(), idx); for s in &p.sessions { by_session.insert(s.id.clone(), s.clone()); + session_refs.push((s.id.clone(), s.path.clone())); + } + } + + let project_count = projects.len(); + let session_count = by_session.len(); + + { + let mut state = self.state.write(); + *state = Snapshot { + projects, + by_project, + by_session_id: by_session, + last_scan: Some(Utc::now()), + search_index: Arc::new(SearchIndex::default()), + }; + } + let metadata_elapsed = started.elapsed(); + tracing::info!( + project_count, + session_count, + elapsed_ms = metadata_elapsed.as_millis() as u64, + "rebuild metadata-phase complete", + ); + + *self.indexing_search.lock() = true; + let search_started = std::time::Instant::now(); + let pairs = parse_session_bodies_parallel(&session_refs); + let mut search_index = SearchIndex::default(); + for (id, body) in pairs { + if !body.is_empty() { + search_index.add(id, body); } } - let new_snap = Snapshot { - projects, - by_project, - by_session_id: by_session, - last_scan: Some(Utc::now()), - }; - *self.state.write() = new_snap; + { + let mut state = self.state.write(); + state.search_index = Arc::new(search_index); + } + *self.indexing_search.lock() = false; + + tracing::info!( + project_count, + session_count, + metadata_ms = metadata_elapsed.as_millis() as u64, + search_ms = search_started.elapsed().as_millis() as u64, + total_ms = started.elapsed().as_millis() as u64, + "rebuild complete", + ); + if session_count == 0 && project_count > 0 { + tracing::warn!( + project_count, + "indexed projects but found zero sessions — root may contain only empty projects", + ); + } } pub async fn rebuild_async(&self) -> Result<(), tokio::task::JoinError> { @@ -194,6 +283,21 @@ mod tests { let idx = Index::new(tmp.path().to_owned()); idx.rebuild(); assert!(!idx.is_scanning()); + assert!(!idx.is_indexing_search()); + } + + #[test] + fn rebuild_populates_search_index_with_session_bodies() { + let tmp = TempDir::new().unwrap(); + seed(&tmp.path().join("-x"), "with_tools.jsonl", "tools-uuid"); + let idx = Index::new(tmp.path().to_owned()); + idx.rebuild(); + + let snap = idx.read(); + let hits = snap.search_index.search(&["readme".to_string()], 50); + assert_eq!(hits.len(), 1); + assert_eq!(hits[0].session_id, "tools-uuid"); + assert!(!hits[0].snippets.is_empty()); } #[test] diff --git a/src/lib.rs b/src/lib.rs index 989732c..91b2617 100644 --- a/src/lib.rs +++ b/src/lib.rs @@ -4,3 +4,4 @@ pub mod index; pub mod parser; pub mod pty; pub mod scanner; +pub mod search; diff --git a/src/parser.rs b/src/parser.rs index 85e4e1a..5dc9004 100644 --- a/src/parser.rs +++ b/src/parser.rs @@ -121,6 +121,47 @@ pub struct Session { pub turns: Vec, } +impl Session { + pub fn indexable_text(&self) -> String { + let mut buf = String::new(); + for turn in &self.turns { + for block in &turn.blocks { + match block { + Block::Text { text } | Block::Thinking { text } => { + if !text.is_empty() { + buf.push_str(text); + buf.push('\n'); + } + } + Block::ToolUse { + tool_name, + tool_input, + .. + } => { + if !tool_name.is_empty() { + buf.push_str(tool_name); + buf.push('\n'); + } + if !tool_input.is_null() { + if let Ok(s) = serde_json::to_string(tool_input) { + buf.push_str(&s); + buf.push('\n'); + } + } + } + Block::ToolResult { tool_output, .. } => { + if !tool_output.is_empty() { + buf.push_str(tool_output); + buf.push('\n'); + } + } + } + } + } + buf + } +} + #[derive(Debug, Deserialize)] #[serde(rename_all = "camelCase")] struct RawAiTitle { @@ -565,6 +606,25 @@ mod tests { fixtures().join(name) } + #[test] + fn indexable_text_collects_all_block_kinds() { + let session = parse_session(&fp("with_tools.jsonl")); + let text = session.indexable_text(); + + assert!(text.contains("List the files"), "user text"); + assert!(text.contains("I should use ls"), "thinking"); + assert!(text.contains("project root"), "assistant text"); + assert!(text.contains("Bash"), "tool name"); + assert!(text.contains("ls -la"), "tool input"); + assert!(text.contains("README.md"), "tool output"); + } + + #[test] + fn indexable_text_empty_session() { + let session = parse_session(&fp("empty.jsonl")); + assert_eq!(session.indexable_text(), ""); + } + #[test] fn meta_minimal() { let m = parse_session_meta(&fp("minimal.jsonl")); diff --git a/src/search.rs b/src/search.rs new file mode 100644 index 0000000..56a34fa --- /dev/null +++ b/src/search.rs @@ -0,0 +1,439 @@ +use std::collections::{HashMap, HashSet}; +use std::sync::OnceLock; + +use serde::Serialize; + +const MIN_TOKEN_LEN: usize = 2; +const CLUSTER_GAP: usize = 80; +const SNIPPET_CONTEXT: usize = 80; +const MAX_SNIPPETS_PER_HIT: usize = 3; + +#[derive(Debug, Default, Clone, Serialize)] +pub struct Snippet { + pub text: String, + pub matches: Vec<(usize, usize)>, +} + +#[derive(Debug, Default, Clone, Serialize)] +pub struct SearchHit { + pub session_id: String, + pub score: u32, + pub snippets: Vec, +} + +#[derive(Debug, Clone)] +struct IndexedDoc { + session_id: String, + chars: Vec, + lowered: Vec, +} + +#[derive(Debug, Default)] +struct SuffixIndex { + entries: Vec<(String, u32)>, + tokens: Vec, +} + +#[derive(Debug, Default)] +pub struct SearchIndex { + postings: HashMap>, + docs: Vec, + suffix: OnceLock, +} + +impl Clone for SearchIndex { + fn clone(&self) -> Self { + Self { + postings: self.postings.clone(), + docs: self.docs.clone(), + suffix: OnceLock::new(), + } + } +} + +impl SearchIndex { + pub fn add(&mut self, session_id: impl Into, body: impl Into) { + let id = session_id.into(); + let body = body.into(); + let idx = self.docs.len() as u32; + + let mut seen = HashSet::new(); + for tok in tokenize_text(&body) { + if seen.insert(tok.clone()) { + self.postings.entry(tok).or_default().push(idx); + } + } + + let chars: Vec = body.chars().collect(); + let lowered: Vec = chars + .iter() + .map(|c| c.to_lowercase().next().unwrap_or(*c)) + .collect(); + self.docs.push(IndexedDoc { + session_id: id, + chars, + lowered, + }); + } + + pub fn search(&self, terms: &[String], limit: usize) -> Vec { + let normalized: Vec = terms + .iter() + .map(|t| t.to_lowercase()) + .filter(|t| t.chars().count() >= MIN_TOKEN_LEN) + .collect(); + if normalized.is_empty() { + return Vec::new(); + } + + let suffix = self.suffix.get_or_init(|| self.build_suffix_index()); + + let mut acc: Option> = None; + for term in &normalized { + let posts = lookup_by_substring(suffix, &self.postings, term); + acc = Some(match acc { + None => posts, + Some(prev) => prev.intersection(&posts).copied().collect(), + }); + } + let candidates = acc.unwrap_or_default(); + + let mut hits: Vec = candidates + .into_iter() + .map(|idx| { + let doc = &self.docs[idx as usize]; + let snippets = build_snippets_from_chars(&doc.chars, &doc.lowered, &normalized); + let score: u32 = snippets.iter().map(|s| s.matches.len() as u32).sum(); + SearchHit { + session_id: doc.session_id.clone(), + score, + snippets, + } + }) + .collect(); + + hits.sort_by(|a, b| { + b.score + .cmp(&a.score) + .then_with(|| a.session_id.cmp(&b.session_id)) + }); + hits.truncate(limit); + hits + } + + fn build_suffix_index(&self) -> SuffixIndex { + let tokens: Vec = self.postings.keys().cloned().collect(); + let mut entries: Vec<(String, u32)> = Vec::new(); + for (idx, tok) in tokens.iter().enumerate() { + let chars: Vec = tok.chars().collect(); + for start in 0..chars.len() { + let suffix: String = chars[start..].iter().collect(); + entries.push((suffix, idx as u32)); + } + } + entries.sort(); + SuffixIndex { entries, tokens } + } +} + +fn lookup_by_substring( + suffix: &SuffixIndex, + postings: &HashMap>, + term: &str, +) -> HashSet { + let start = suffix.entries.partition_point(|(s, _)| s.as_str() < term); + let mut token_ids: HashSet = HashSet::new(); + for (s, tid) in suffix.entries[start..].iter() { + if !s.starts_with(term) { + break; + } + token_ids.insert(*tid); + } + let mut docs: HashSet = HashSet::new(); + for tid in token_ids { + if let Some(post) = postings.get(&suffix.tokens[tid as usize]) { + docs.extend(post.iter().copied()); + } + } + docs +} + +fn build_snippets_from_chars(chars: &[char], lower: &[char], terms: &[String]) -> Vec { + if chars.is_empty() { + return Vec::new(); + } + + let mut hits: Vec<(usize, usize)> = Vec::new(); + for term in terms { + let needle: Vec = term.chars().collect(); + if needle.is_empty() || needle.len() > lower.len() { + continue; + } + for i in 0..=(lower.len() - needle.len()) { + if lower[i..i + needle.len()] == needle[..] { + hits.push((i, i + needle.len())); + } + } + } + if hits.is_empty() { + return Vec::new(); + } + + hits.sort(); + let mut merged: Vec<(usize, usize)> = vec![hits[0]]; + for &(s, e) in &hits[1..] { + let last = merged.last_mut().unwrap(); + if s <= last.1 { + if e > last.1 { + last.1 = e; + } + } else { + merged.push((s, e)); + } + } + + let mut clusters: Vec> = Vec::new(); + for m in merged { + match clusters.last_mut() { + Some(c) if m.0.saturating_sub(c.last().unwrap().1) <= CLUSTER_GAP => { + c.push(m); + } + _ => clusters.push(vec![m]), + } + } + + clusters + .into_iter() + .take(MAX_SNIPPETS_PER_HIT) + .map(|cluster| { + let first = cluster.first().unwrap().0; + let last = cluster.last().unwrap().1; + let start = first.saturating_sub(SNIPPET_CONTEXT); + let end = (last + SNIPPET_CONTEXT).min(chars.len()); + let text: String = chars[start..end].iter().collect(); + let matches: Vec<(usize, usize)> = cluster + .iter() + .map(|&(s, e)| (s - start, e - start)) + .collect(); + Snippet { text, matches } + }) + .collect() +} + +pub fn tokenize_text(input: &str) -> Vec { + let mut out = Vec::new(); + let mut buf = String::new(); + for ch in input.chars() { + if ch.is_alphanumeric() { + for c in ch.to_lowercase() { + buf.push(c); + } + } else if buf.chars().count() >= MIN_TOKEN_LEN { + out.push(std::mem::take(&mut buf)); + } else { + buf.clear(); + } + } + if buf.chars().count() >= MIN_TOKEN_LEN { + out.push(buf); + } + out +} + +#[cfg(test)] +mod tests { + use super::*; + + #[test] + fn tokenize_text_lowercases_and_splits_on_non_alnum() { + assert_eq!( + tokenize_text("Hello, World! it's-AUTH bug"), + vec!["hello", "world", "it", "auth", "bug"], + ); + } + + #[test] + fn tokenize_text_drops_single_char_tokens() { + assert_eq!(tokenize_text("a b cd e"), vec!["cd"]); + } + + #[test] + fn tokenize_text_handles_empty_and_whitespace() { + assert_eq!(tokenize_text(""), Vec::::new()); + assert_eq!(tokenize_text(" "), Vec::::new()); + assert_eq!(tokenize_text("!@#$%"), Vec::::new()); + } + + #[test] + fn tokenize_text_supports_unicode() { + assert_eq!(tokenize_text("Привет, МИР!"), vec!["привет", "мир"],); + } + + #[test] + fn tokenize_text_keeps_alphanumeric_words_with_digits() { + assert_eq!( + tokenize_text("file42 ver1.10"), + vec!["file42", "ver1", "10"] + ); + } + + #[test] + fn tokenize_text_drops_short_numeric_tails() { + assert_eq!(tokenize_text("v1.0"), vec!["v1"]); + } + + fn ids(hits: &[SearchHit]) -> Vec<&str> { + hits.iter().map(|h| h.session_id.as_str()).collect() + } + + #[test] + fn index_empty_returns_no_hits() { + let idx = SearchIndex::default(); + let q = ["auth".to_string()]; + assert!(idx.search(&q, 50).is_empty()); + } + + #[test] + fn index_single_session_single_term_hit() { + let mut idx = SearchIndex::default(); + idx.add("s1", "Authentication is broken"); + let hits = idx.search(&["auth".to_string()], 50); + assert_eq!(ids(&hits), vec!["s1"]); + } + + #[test] + fn index_multi_term_is_and_across_terms() { + let mut idx = SearchIndex::default(); + idx.add("s1", "Authentication bug fix"); + idx.add("s2", "Authentication only"); + idx.add("s3", "Just bug"); + let hits = idx.search(&["auth".to_string(), "bug".to_string()], 50); + let mut got = ids(&hits); + got.sort(); + assert_eq!(got, vec!["s1"]); + } + + #[test] + fn index_substring_of_token_matches() { + let mut idx = SearchIndex::default(); + idx.add("s1", "Authentication broken"); + idx.add("s2", "Authorization missing"); + let hits = idx.search(&["auth".to_string()], 50); + let mut got = ids(&hits); + got.sort(); + assert_eq!(got, vec!["s1", "s2"]); + } + + #[test] + fn index_case_insensitive() { + let mut idx = SearchIndex::default(); + idx.add("s1", "AUTH bug"); + let hits = idx.search(&["AuTh".to_string()], 50); + assert_eq!(ids(&hits), vec!["s1"]); + } + + #[test] + fn index_drops_short_query_terms_silently() { + let mut idx = SearchIndex::default(); + idx.add("s1", "auth bug"); + let hits = idx.search(&["a".to_string(), "auth".to_string()], 50); + assert_eq!(ids(&hits), vec!["s1"]); + } + + #[test] + fn search_hit_carries_a_snippet_with_match_offsets() { + let mut idx = SearchIndex::default(); + idx.add("s1", "We need to fix the auth bug today"); + let hits = idx.search(&["auth".to_string()], 50); + assert_eq!(hits.len(), 1); + let snip = hits[0].snippets.first().expect("at least one snippet"); + assert!(snip.text.to_lowercase().contains("auth")); + assert!(!snip.matches.is_empty()); + let (start, end) = snip.matches[0]; + let snip_chars: Vec = snip.text.chars().collect(); + assert!(end <= snip_chars.len()); + let matched: String = snip_chars[start..end].iter().collect(); + assert_eq!( + matched.to_lowercase(), + "auth", + "snippet.matches[0] should bracket the matched term, got {matched:?}", + ); + } + + #[test] + fn snippet_offset_at_document_start() { + let mut idx = SearchIndex::default(); + idx.add( + "s1", + "auth followed by lots of filler text here filling space", + ); + let hits = idx.search(&["auth".to_string()], 50); + let snip = &hits[0].snippets[0]; + let (start, end) = snip.matches[0]; + let chars: Vec = snip.text.chars().collect(); + let matched: String = chars[start..end].iter().collect(); + assert_eq!(matched.to_lowercase(), "auth"); + assert_eq!(start, 0, "term at position 0 should land at snippet start"); + } + + #[test] + fn snippet_offset_at_document_end() { + let mut idx = SearchIndex::default(); + idx.add( + "s1", + "this document is mostly filler and ends with the marker auth", + ); + let hits = idx.search(&["auth".to_string()], 50); + let snip = &hits[0].snippets[0]; + let (start, end) = snip.matches[0]; + let chars: Vec = snip.text.chars().collect(); + let matched: String = chars[start..end].iter().collect(); + assert_eq!(matched.to_lowercase(), "auth"); + assert_eq!( + end, + chars.len(), + "term at document end should land at snippet end", + ); + } + + #[test] + fn search_results_sorted_by_score_desc() { + let mut idx = SearchIndex::default(); + idx.add("a", "auth auth auth bug"); + idx.add("b", "auth bug"); + let hits = idx.search(&["auth".to_string()], 50); + assert_eq!(hits[0].session_id, "a"); + assert!(hits[0].score >= hits[1].score); + } + + #[test] + fn search_respects_limit() { + let mut idx = SearchIndex::default(); + for i in 0..10 { + idx.add(format!("s{i}"), "auth bug"); + } + let hits = idx.search(&["auth".to_string()], 3); + assert_eq!(hits.len(), 3); + } + + fn fixture_path(name: &str) -> std::path::PathBuf { + std::path::PathBuf::from(env!("CARGO_MANIFEST_DIR")) + .join("tests/fixtures/sessions") + .join(name) + } + + #[test] + fn build_index_from_session_finds_internal_terms() { + let session = crate::parser::parse_session(&fixture_path("with_tools.jsonl")); + let mut idx = SearchIndex::default(); + idx.add(&session.meta.id, session.indexable_text()); + + let hits = idx.search(&["readme".to_string()], 50); + assert_eq!( + hits.len(), + 1, + "should match the README snippet inside tool output" + ); + assert_eq!(hits[0].session_id, session.meta.id); + } +} diff --git a/static/app.js b/static/app.js index a90878d..dec64a1 100644 --- a/static/app.js +++ b/static/app.js @@ -1,3 +1,15 @@ +import { + parseQuery, + matches, + highlightSegments, + segmentsFromRanges, +} from "./search.mjs"; + +const SEARCH_DEBOUNCE_MS = 120; +const SEARCH_STORAGE_KEY = "constellation.search"; +const DEFAULT_SEARCH_LIMIT = 50; +const SEARCH_FETCH_TIMEOUT_MS = 10000; + const state = { projects: [], activeProject: null, @@ -12,11 +24,17 @@ const state = { }; async function api(path, opts = {}) { - const res = await fetch(path, opts); - if (!res.ok) { - throw new Error(`${res.status} ${res.statusText} — ${path}`); + const ctrl = new AbortController(); + const timer = setTimeout(() => ctrl.abort(), opts.timeoutMs ?? SEARCH_FETCH_TIMEOUT_MS); + try { + const res = await fetch(path, { ...opts, signal: ctrl.signal }); + if (!res.ok) { + throw new Error(`${res.status} ${res.statusText} — ${path}`); + } + return await res.json(); + } finally { + clearTimeout(timer); } - return res.json(); } function wsUrl(path) { @@ -30,6 +48,8 @@ const API = { projects: () => api("/api/projects"), sessions: (proj) => api(`/api/projects/${encodeURIComponent(proj)}/sessions`), session: (id) => api(`/api/sessions/${encodeURIComponent(id)}`), + search: (q, limit = 200) => + api(`/api/search?q=${encodeURIComponent(q)}&limit=${limit}`), resumeWsUrl: (id, fork) => wsUrl(`/api/sessions/${encodeURIComponent(id)}/pty${fork ? "?fork=true" : ""}`), newChatWsUrl: (proj) => wsUrl(`/api/projects/${encodeURIComponent(proj)}/new-chat`), @@ -165,6 +185,72 @@ function escapeHtmlText(s) { return div.innerHTML; } +function debounce(fn, ms) { + let t; + return (...args) => { + clearTimeout(t); + t = setTimeout(() => fn(...args), ms); + }; +} + +let _allSessionsPromise = null; + +function _fetchAllSessions() { + const failures = []; + return Promise.all( + state.projects.map((p) => + API.sessions(p.sanitized_name) + .then((sessions) => + sessions.map((s) => ({ + ...s, + _project: { + sanitized_name: p.sanitized_name, + display_path: p.display_path, + cwd: p.cwd, + }, + })), + ) + .catch((err) => { + failures.push({ project: p.sanitized_name, err: String(err) }); + return []; + }), + ), + ).then((results) => { + const all = results.flat(); + all.sort((a, b) => new Date(b.last_at || 0) - new Date(a.last_at || 0)); + if (failures.length) { + console.warn("ensureAllSessions: failed projects", failures); + } + return { sessions: all, failures }; + }); +} + +function ensureAllSessions() { + if (!_allSessionsPromise) { + _allSessionsPromise = _fetchAllSessions(); + } + return _allSessionsPromise; +} + +function invalidateAllSessions() { + _allSessionsPromise = null; +} + +function persistSearch(q) { + try { + if (q) localStorage.setItem(SEARCH_STORAGE_KEY, q); + else localStorage.removeItem(SEARCH_STORAGE_KEY); + } catch {} +} + +function loadPersistedSearch() { + try { + return localStorage.getItem(SEARCH_STORAGE_KEY) || ""; + } catch { + return ""; + } +} + function renderRail() { const list = $("#proj-list"); list.innerHTML = ""; @@ -213,11 +299,14 @@ function appendPath(parent, path) { } } -function renderChatList(sessions) { +function renderChatList(sessions, options = {}) { const root = $("#chat-list"); root.innerHTML = ""; if (!sessions || sessions.length === 0) { - root.appendChild(el("div", { class: "placeholder" }, "no conversations")); + const msg = options.highlightTerms?.length + ? "no matches" + : "no conversations"; + root.appendChild(el("div", { class: "placeholder" }, msg)); return; } @@ -228,11 +317,35 @@ function renderChatList(sessions) { root.appendChild(el("div", { class: "day" }, label)); prevLabel = label; } - root.appendChild(renderChatRow(s)); + root.appendChild(renderChatRow(s, options)); } } -function renderChatRow(s) { +function segmentsToNode(segments) { + const node = document.createElement("span"); + for (const seg of segments) { + if (seg.match) { + const mark = document.createElement("mark"); + mark.textContent = seg.text; + node.appendChild(mark); + } else if (seg.text) { + node.appendChild(document.createTextNode(seg.text)); + } + } + return node; +} + +function withHighlight(text, terms) { + return segmentsToNode(highlightSegments(text || "", terms)); +} + +function withRangeHighlight(text, ranges) { + return segmentsToNode(segmentsFromRanges(text || "", ranges)); +} + +function renderChatRow(s, options = {}) { + const terms = options.highlightTerms || []; + const backendSnippets = options.snippetsById?.get(s.id) || null; const total = tokensTotal(s.usage); const metaChildren = [ el("span", {}, `${s.message_count} msgs`), @@ -252,6 +365,32 @@ function renderChatRow(s) { metaChildren.push(el("span", { class: "sep" }, "·")); metaChildren.push(el("span", {}, s.model)); } + + const titleNode = el("div", { class: "title", title: s.title }); + titleNode.appendChild(withHighlight(s.title || "(untitled)", terms)); + + const children = [titleNode]; + + if (backendSnippets?.length) { + for (const snip of backendSnippets.slice(0, 2)) { + const snipNode = el("div", { class: "snippet body" }); + snipNode.appendChild(withRangeHighlight(snip.text || "", snip.matches)); + children.push(snipNode); + } + } else { + const snippetNode = el("div", { class: "snippet" }); + snippetNode.appendChild(withHighlight(s.snippet || "—", terms)); + children.push(snippetNode); + } + + if (options.showProject && s._project?.display_path) { + const chip = el("div", { class: "proj-chip", title: s._project.cwd || "" }); + chip.appendChild(withHighlight(s._project.display_path, terms)); + children.push(chip); + } + + children.push(el("div", { class: "meta" }, ...metaChildren)); + return el( "article", { @@ -259,9 +398,7 @@ function renderChatRow(s) { "data-id": s.id, onclick: () => selectSession(s.id), }, - el("div", { class: "title", title: s.title }, s.title || "(untitled)"), - el("div", { class: "snippet" }, s.snippet || "—"), - el("div", { class: "meta" }, ...metaChildren), + ...children, ); } @@ -459,17 +596,59 @@ async function selectSession(id) { } } -function applySearchFilter() { - const q = state.searchQuery.trim().toLowerCase(); - let filtered = state.sessions; - if (q) { - filtered = state.sessions.filter(s => { - return (s.title || "").toLowerCase().includes(q) - || (s.snippet || "").toLowerCase().includes(q) - || (s.id || "").toLowerCase().includes(q); - }); +let _searchSeq = 0; + +async function applySearchFilter() { + const seq = ++_searchSeq; + const q = state.searchQuery.trim(); + if (!q) { + if (seq === _searchSeq) renderChatList(state.sessions); + return; } - renderChatList(filtered); + + const parsed = parseQuery(q); + const { sessions: pool, failures } = await ensureAllSessions(); + if (seq !== _searchSeq) return; + + const opsOnly = { ...parsed, terms: [] }; + const opFiltered = pool.filter((s) => matches(s, opsOnly)); + const byId = new Map(opFiltered.map((s) => [s.id, s])); + + let filtered; + const snippetsById = new Map(); + const statusBits = []; + if (failures.length) statusBits.push(`${failures.length} projects unavailable`); + + if (parsed.terms.length > 0) { + try { + const termsQuery = parsed.terms.join(" "); + const hits = await API.search(termsQuery, DEFAULT_SEARCH_LIMIT); + if (seq !== _searchSeq) return; + filtered = []; + for (const hit of hits) { + const session = byId.get(hit.session_id); + if (!session) continue; + filtered.push(session); + if (hit.snippets?.length) snippetsById.set(hit.session_id, hit.snippets); + } + } catch (e) { + if (seq !== _searchSeq) return; + console.warn("backend search failed, falling back to metadata only", e); + filtered = opFiltered.filter((s) => matches(s, parsed)); + statusBits.push("metadata only — backend unavailable"); + } + } else { + filtered = opFiltered; + } + + if (seq !== _searchSeq) return; + renderChatList(filtered, { + highlightTerms: parsed.terms, + showProject: true, + snippetsById, + }); + const suffix = statusBits.length ? ` · (${statusBits.join("; ")})` : ""; + renderListHead(`search · ${q}${suffix}`, filtered); } function openTerminal(sessionId, { fork = false } = {}) { @@ -654,14 +833,25 @@ function toast(msg, kind = "info") { }, 2400); } +let _lastSeenScan = null; + async function refreshStats() { try { const s = await API.stats(); + if (_lastSeenScan && s.last_scan && s.last_scan !== _lastSeenScan) { + invalidateAllSessions(); + } + _lastSeenScan = s.last_scan; + const sysEl = $("#sys"); - sysEl.classList.toggle("scanning", s.scanning); + sysEl.classList.toggle("scanning", s.scanning || s.indexing_search); sysEl.classList.remove("error"); const totalTok = tokensTotal(s.total_usage); - const main = s.scanning ? "scanning…" : `${s.sessions} indexed`; + const main = s.scanning + ? "scanning…" + : s.indexing_search + ? "indexing search…" + : `${s.sessions} indexed`; const tokPart = totalTok > 0 ? ` · ${fmtTok(totalTok)} tok` : ""; $("#sys-text").textContent = main + tokPart; const u = s.total_usage || {}; @@ -694,9 +884,11 @@ async function loadProjects(preferProject = null) { } function bindUi() { + const debouncedFilter = debounce(() => applySearchFilter(), SEARCH_DEBOUNCE_MS); $("#search").addEventListener("input", (e) => { state.searchQuery = e.target.value; - applySearchFilter(); + persistSearch(state.searchQuery); + debouncedFilter(); }); document.addEventListener("keydown", (e) => { @@ -730,6 +922,7 @@ function bindUi() { btn.textContent = "↻ scanning…"; try { await API.reindex(); + invalidateAllSessions(); await refreshStats(); await loadProjects(state.activeProject); } finally { @@ -745,8 +938,14 @@ function bindUi() { async function main() { bindUi(); + const persisted = loadPersistedSearch(); + if (persisted) { + state.searchQuery = persisted; + $("#search").value = persisted; + } await refreshStats(); await loadProjects(); + if (state.searchQuery) await applySearchFilter(); setInterval(refreshStats, 15000); } diff --git a/static/index.html b/static/index.html index 8499011..4cf7e4a 100644 --- a/static/index.html +++ b/static/index.html @@ -45,7 +45,13 @@
diff --git a/static/search.mjs b/static/search.mjs new file mode 100644 index 0000000..6a4e901 --- /dev/null +++ b/static/search.mjs @@ -0,0 +1,204 @@ +const LIST_OPERATORS = new Set(["project", "model", "has"]); +const DATE_OPERATORS = new Set(["before", "after"]); + +export function tokenize(input) { + const out = []; + if (!input) return out; + let buf = ""; + let inQuote = false; + for (const ch of input) { + if (ch === '"') { + inQuote = !inQuote; + continue; + } + if (!inQuote && /\s/.test(ch)) { + if (buf) { + out.push(buf); + buf = ""; + } + continue; + } + buf += ch; + } + if (buf) out.push(buf); + return out; +} + +export function parseQuery(raw) { + const result = { + terms: [], + operators: { project: [], model: [], has: [], before: null, after: null }, + }; + if (typeof raw !== "string" || raw.length === 0) return result; + + for (const tok of tokenize(raw)) { + const colon = tok.indexOf(":"); + if (colon > 0) { + const key = tok.slice(0, colon).toLowerCase(); + const value = tok.slice(colon + 1); + if (value && LIST_OPERATORS.has(key)) { + result.operators[key].push(value.toLowerCase()); + continue; + } + if (value && DATE_OPERATORS.has(key)) { + const d = new Date(value); + if (!Number.isNaN(d.getTime())) { + result.operators[key] = d; + continue; + } + } + } + result.terms.push(tok.toLowerCase()); + } + return result; +} + +function lc(value) { + return typeof value === "string" ? value.toLowerCase() : ""; +} + +function sessionHasFlag(session, flag) { + const usage = session.usage || {}; + switch (flag) { + case "tool": + return (session.tool_count || 0) > 0; + case "cache": + return (usage.cache_read || 0) + (usage.cache_creation || 0) > 0; + case "model": + return Boolean(session.model); + case "title": + return Boolean(session.title); + default: + return true; + } +} + +export function matches(session, parsed) { + if (!session || !parsed) return true; + + const project = session._project || {}; + const fields = [ + lc(session.title), + lc(session.snippet), + lc(session.id), + lc(session.cwd), + lc(session.model), + lc(project.display_path), + lc(project.sanitized_name), + ]; + + for (const term of parsed.terms) { + if (!fields.some((f) => f.includes(term))) return false; + } + + const projectFields = [lc(project.display_path), lc(project.sanitized_name)]; + if (parsed.operators.project.length > 0) { + const ok = parsed.operators.project.some((v) => + projectFields.some((f) => f.includes(v)), + ); + if (!ok) return false; + } + + if (parsed.operators.model.length > 0) { + const model = lc(session.model); + if (!parsed.operators.model.some((v) => model.includes(v))) return false; + } + + for (const flag of parsed.operators.has) { + if (!sessionHasFlag(session, flag)) return false; + } + + if (parsed.operators.before) { + if (!session.last_at) return false; + const ts = new Date(session.last_at); + if (Number.isNaN(ts.getTime()) || ts >= parsed.operators.before) return false; + } + if (parsed.operators.after) { + if (!session.last_at) return false; + const ts = new Date(session.last_at); + if (Number.isNaN(ts.getTime()) || ts <= parsed.operators.after) return false; + } + + return true; +} + +const HTML_ENTITIES = { "&": "&", "<": "<", ">": ">", '"': """, "'": "'" }; + +export function escapeHtml(s) { + if (s == null) return ""; + return String(s).replace(/[&<>"']/g, (ch) => HTML_ENTITIES[ch]); +} + +function escapeRegex(s) { + return s.replace(/[.*+?^${}()|[\]\\]/g, "\\$&"); +} + +export function highlightSegments(text, terms) { + if (text == null) return [{ text: "", match: false }]; + const source = String(text); + const valid = (terms || []).filter((t) => typeof t === "string" && t.length > 0); + if (valid.length === 0 || source.length === 0) { + return [{ text: source, match: false }]; + } + const pattern = valid.map(escapeRegex).join("|"); + const re = new RegExp(pattern, "gi"); + + const segments = []; + let last = 0; + for (const m of source.matchAll(re)) { + if (m[0].length === 0) continue; + if (m.index > last) { + segments.push({ text: source.slice(last, m.index), match: false }); + } + segments.push({ text: m[0], match: true }); + last = m.index + m[0].length; + } + if (last < source.length) { + segments.push({ text: source.slice(last), match: false }); + } + if (segments.length === 0) { + return [{ text: source, match: false }]; + } + return segments; +} + +export function highlight(text, terms) { + if (text == null || text === "") return ""; + return highlightSegments(String(text), terms) + .map((seg) => + seg.match + ? "" + escapeHtml(seg.text) + "" + : escapeHtml(seg.text), + ) + .join(""); +} + +export function segmentsFromRanges(text, ranges) { + if (text == null) return [{ text: "", match: false }]; + const source = String(text); + const chars = Array.from(source); + if (chars.length === 0 || !ranges || ranges.length === 0) { + return [{ text: source, match: false }]; + } + const sorted = [...ranges] + .map(([s, e]) => [Math.max(0, s), Math.min(chars.length, e)]) + .filter(([s, e]) => s < e && s < chars.length) + .sort((a, b) => a[0] - b[0]); + + if (sorted.length === 0) return [{ text: source, match: false }]; + + const segments = []; + let last = 0; + for (const [s, e] of sorted) { + if (s < last) continue; + if (s > last) { + segments.push({ text: chars.slice(last, s).join(""), match: false }); + } + segments.push({ text: chars.slice(s, e).join(""), match: true }); + last = e; + } + if (last < chars.length) { + segments.push({ text: chars.slice(last).join(""), match: false }); + } + return segments; +} diff --git a/static/styles.css b/static/styles.css index cc643ec..9543655 100644 --- a/static/styles.css +++ b/static/styles.css @@ -320,6 +320,46 @@ a:hover { text-decoration: underline; } font-size: 0.875rem; } +.chat .proj-chip { + display: inline-block; + max-width: 100%; + margin-bottom: 8px; + padding: 2px 8px; + border: 1px solid var(--line); + border-radius: 10px; + background: var(--bg-0); + font-family: var(--font-mono); + font-size: 0.6875rem; + color: var(--fg-dim); + overflow: hidden; + text-overflow: ellipsis; + white-space: nowrap; +} + +mark { + background: var(--accent-soft); + color: var(--accent); + padding: 0 2px; + border-radius: 2px; + font-weight: 600; +} +.chat mark { + color: inherit; +} + +.chat .snippet.body { + font-family: var(--font-mono); + font-size: 0.75rem; + line-height: 1.5; + color: var(--fg-mute); + border-left: 2px solid var(--accent-soft); + padding: 2px 0 2px 8px; + margin-bottom: 4px; + -webkit-line-clamp: 2; + white-space: pre-wrap; + word-break: break-word; +} + .stage { diff --git a/tests/http_integration.rs b/tests/http_integration.rs index 350711a..08be637 100644 --- a/tests/http_integration.rs +++ b/tests/http_integration.rs @@ -22,6 +22,32 @@ fn seed(project_dir: &Path, fixture: &str, session_name: &str) { .unwrap(); } +fn seed_inline(project_dir: &Path, session_id: &str, content_term: &str) { + std::fs::create_dir_all(project_dir).unwrap(); + let escaped = content_term.replace('"', "\\\""); + let content = format!( + "{{\"type\":\"ai-title\",\"aiTitle\":\"test\",\"sessionId\":\"{session_id}\"}}\n\ +{{\"type\":\"user\",\"message\":{{\"role\":\"user\",\"content\":\"{escaped}\"}},\ +\"uuid\":\"u-1\",\"timestamp\":\"2026-05-25T11:00:00.000Z\",\"sessionId\":\"{session_id}\",\"cwd\":\"/srv/x\"}}\n" + ); + std::fs::write(project_dir.join(format!("{session_id}.jsonl")), content).unwrap(); +} + +fn ready_router_with_n_matches(n: usize, term: &str) -> (axum::Router, TempDir) { + let tmp = TempDir::new().unwrap(); + for i in 0..n { + seed_inline( + &tmp.path().join(format!("-proj-{i}")), + &format!("sess-{i:03}"), + term, + ); + } + let index = Index::new(tmp.path().to_owned()); + index.rebuild(); + let state = AppState::new(index); + (build_router(state), tmp) +} + fn ready_router() -> (axum::Router, TempDir) { let tmp = TempDir::new().unwrap(); seed( @@ -55,6 +81,7 @@ fn req_get(path: &str) -> Request { Request::builder() .method(Method::GET) .uri(path) + .header(header::HOST, "127.0.0.1:6767") .body(Body::empty()) .unwrap() } @@ -63,6 +90,26 @@ fn req_post(path: &str) -> Request { Request::builder() .method(Method::POST) .uri(path) + .header(header::HOST, "127.0.0.1:6767") + .body(Body::empty()) + .unwrap() +} + +fn req_get_with_host(path: &str, host: &str) -> Request { + Request::builder() + .method(Method::GET) + .uri(path) + .header(header::HOST, host) + .body(Body::empty()) + .unwrap() +} + +fn req_get_with_origin(path: &str, origin: &str) -> Request { + Request::builder() + .method(Method::GET) + .uri(path) + .header(header::HOST, "127.0.0.1:6767") + .header(header::ORIGIN, origin) .body(Body::empty()) .unwrap() } @@ -89,6 +136,139 @@ async fn get_stats_returns_populated_index_shape() { assert_eq!(v["total_usage"]["output"], 300); } +#[tokio::test] +async fn api_search_rejects_non_loopback_host() { + let (app, _tmp) = ready_router(); + let resp = app + .oneshot(req_get_with_host( + "/api/search?q=readme", + "evil.example.com", + )) + .await + .unwrap(); + assert_eq!(resp.status(), StatusCode::FORBIDDEN); +} + +#[tokio::test] +async fn api_stats_rejects_non_loopback_host() { + let (app, _tmp) = ready_router(); + let resp = app + .oneshot(req_get_with_host("/api/stats", "attacker.example.com")) + .await + .unwrap(); + assert_eq!(resp.status(), StatusCode::FORBIDDEN); +} + +#[tokio::test] +async fn api_rejects_external_origin_even_when_host_is_loopback() { + let (app, _tmp) = ready_router(); + let resp = app + .oneshot(req_get_with_origin( + "/api/stats", + "https://evil.example.com", + )) + .await + .unwrap(); + assert_eq!(resp.status(), StatusCode::FORBIDDEN); +} + +#[tokio::test] +async fn api_accepts_loopback_host_with_no_origin() { + let (app, _tmp) = ready_router(); + let resp = app.oneshot(req_get("/api/stats")).await.unwrap(); + assert_eq!(resp.status(), StatusCode::OK); +} + +#[tokio::test] +async fn api_accepts_loopback_host_with_loopback_origin() { + let (app, _tmp) = ready_router(); + let resp = app + .oneshot(req_get_with_origin("/api/stats", "http://127.0.0.1:6767")) + .await + .unwrap(); + assert_eq!(resp.status(), StatusCode::OK); +} + +#[tokio::test] +async fn get_search_finds_term_inside_session_body() { + let (app, _tmp) = ready_router(); + let resp = app.oneshot(req_get("/api/search?q=readme")).await.unwrap(); + assert_eq!(resp.status(), StatusCode::OK); + let v = body_to_json(resp.into_body()).await; + let arr = v.as_array().expect("search returns array"); + assert!(!arr.is_empty(), "expected at least one hit"); + assert_eq!(arr[0]["session_id"], "tools-uuid"); + let snippets = arr[0]["snippets"].as_array().expect("snippets array"); + assert!(!snippets.is_empty()); + let first = &snippets[0]; + assert!(first["text"].as_str().is_some()); + assert!(first["matches"].as_array().is_some()); +} + +#[tokio::test] +async fn get_search_empty_query_returns_empty() { + let (app, _tmp) = ready_router(); + let resp = app.oneshot(req_get("/api/search?q=")).await.unwrap(); + assert_eq!(resp.status(), StatusCode::OK); + let v = body_to_json(resp.into_body()).await; + assert_eq!(v.as_array().unwrap().len(), 0); +} + +#[tokio::test] +async fn get_search_clamps_limit_max_actually_caps_results() { + let (app, _tmp) = ready_router_with_n_matches(5, "alpha"); + let resp = app + .oneshot(req_get("/api/search?q=alpha&limit=2")) + .await + .unwrap(); + assert_eq!(resp.status(), StatusCode::OK); + let v = body_to_json(resp.into_body()).await; + let arr = v.as_array().unwrap(); + assert_eq!( + arr.len(), + 2, + "limit=2 should cap at 2 hits despite 5 matching sessions", + ); +} + +#[tokio::test] +async fn get_search_clamps_limit_zero_to_min() { + let (app, _tmp) = ready_router_with_n_matches(3, "alpha"); + let resp = app + .oneshot(req_get("/api/search?q=alpha&limit=0")) + .await + .unwrap(); + assert_eq!(resp.status(), StatusCode::OK); + let v = body_to_json(resp.into_body()).await; + let arr = v.as_array().unwrap(); + assert_eq!(arr.len(), 1, "limit=0 should clamp up to MIN=1"); +} + +#[tokio::test] +async fn get_search_quoted_phrase_is_anded_by_server_tokenize() { + let (app, _tmp) = ready_router_with_n_matches(1, "Authentication module"); + let resp = app + .oneshot(req_get("/api/search?q=%22Authentication+module%22")) + .await + .unwrap(); + assert_eq!(resp.status(), StatusCode::OK); + let v = body_to_json(resp.into_body()).await; + let arr = v.as_array().unwrap(); + assert!( + !arr.is_empty(), + "server should tokenize the quoted phrase into AND'd terms and find the session", + ); +} + +#[tokio::test] +async fn get_search_no_q_param_returns_empty_array() { + let (app, _tmp) = ready_router(); + let resp = app.oneshot(req_get("/api/search")).await.unwrap(); + assert_eq!(resp.status(), StatusCode::OK); + let v = body_to_json(resp.into_body()).await; + assert_eq!(v.as_array().unwrap().len(), 0); +} + #[tokio::test] async fn post_reindex_returns_fresh_stats_with_scanning_false() { let (app, _tmp) = ready_router(); diff --git a/tests/js_search.rs b/tests/js_search.rs new file mode 100644 index 0000000..4cd936e --- /dev/null +++ b/tests/js_search.rs @@ -0,0 +1,26 @@ +use std::process::Command; + +#[test] +fn js_search_tests_pass() { + let probe = Command::new("node").arg("--version").output(); + let Ok(out) = probe else { + eprintln!("node not found on PATH — skipping JS search tests"); + return; + }; + if !out.status.success() { + eprintln!("`node --version` failed — skipping JS search tests"); + return; + } + + let manifest = env!("CARGO_MANIFEST_DIR"); + let status = Command::new("node") + .args(["--test", "tests/search.test.mjs"]) + .current_dir(manifest) + .status() + .expect("failed to spawn node for JS tests"); + + assert!( + status.success(), + "JS search tests failed (run `node --test tests/search.test.mjs` to see details)", + ); +} diff --git a/tests/search.test.mjs b/tests/search.test.mjs new file mode 100644 index 0000000..c95521d --- /dev/null +++ b/tests/search.test.mjs @@ -0,0 +1,415 @@ +import { test } from "node:test"; +import assert from "node:assert/strict"; + +import { + tokenize, + parseQuery, + matches, + highlight, + highlightSegments, + segmentsFromRanges, +} from "../static/search.mjs"; + +test("tokenize splits by whitespace", () => { + assert.deepEqual(tokenize("foo bar baz"), ["foo", "bar", "baz"]); +}); + +test("tokenize keeps quoted strings as one token", () => { + assert.deepEqual( + tokenize('foo "merge conflict" bar'), + ["foo", "merge conflict", "bar"], + ); +}); + +test("tokenize collapses runs of whitespace", () => { + assert.deepEqual(tokenize("foo bar"), ["foo", "bar"]); +}); + +test("tokenize trims leading and trailing whitespace", () => { + assert.deepEqual(tokenize(" foo "), ["foo"]); +}); + +test("tokenize returns empty array for empty / blank input", () => { + assert.deepEqual(tokenize(""), []); + assert.deepEqual(tokenize(" "), []); +}); + +test("tokenize treats unterminated quote as a single trailing token", () => { + assert.deepEqual(tokenize('foo "bar baz'), ["foo", "bar baz"]); +}); + +test("tokenize keeps colon-form operators intact", () => { + assert.deepEqual(tokenize("project:auth bug"), ["project:auth", "bug"]); +}); + +test("tokenize allows quotes inside an operator value", () => { + assert.deepEqual( + tokenize('project:"my project" foo'), + ["project:my project", "foo"], + ); +}); + +test("parseQuery: empty input returns empty defaults", () => { + const r = parseQuery(""); + assert.deepEqual(r.terms, []); + assert.deepEqual(r.operators.project, []); + assert.deepEqual(r.operators.model, []); + assert.deepEqual(r.operators.has, []); + assert.equal(r.operators.before, null); + assert.equal(r.operators.after, null); +}); + +test("parseQuery: plain term goes to terms, lowercased", () => { + assert.deepEqual(parseQuery("Auth").terms, ["auth"]); +}); + +test("parseQuery: multiple plain terms preserved (AND set)", () => { + assert.deepEqual(parseQuery("auth bug").terms, ["auth", "bug"]); +}); + +test("parseQuery: project: operator captured", () => { + const r = parseQuery("project:Web"); + assert.deepEqual(r.terms, []); + assert.deepEqual(r.operators.project, ["web"]); +}); + +test("parseQuery: model: and has: operators captured", () => { + const r = parseQuery("model:OPUS has:tool"); + assert.deepEqual(r.operators.model, ["opus"]); + assert.deepEqual(r.operators.has, ["tool"]); +}); + +test("parseQuery: before: parses ISO date", () => { + const r = parseQuery("before:2026-01-15"); + assert.ok(r.operators.before instanceof Date); + assert.equal(r.operators.before.toISOString().slice(0, 10), "2026-01-15"); +}); + +test("parseQuery: after: parses ISO date", () => { + const r = parseQuery("after:2026-01-15"); + assert.ok(r.operators.after instanceof Date); +}); + +test("parseQuery: mixed terms and operators", () => { + const r = parseQuery("auth project:web has:tool"); + assert.deepEqual(r.terms, ["auth"]); + assert.deepEqual(r.operators.project, ["web"]); + assert.deepEqual(r.operators.has, ["tool"]); +}); + +test("parseQuery: invalid date falls through to a term", () => { + const r = parseQuery("before:not-a-date"); + assert.deepEqual(r.terms, ["before:not-a-date"]); + assert.equal(r.operators.before, null); +}); + +test("parseQuery: unknown operator stays as a plain term", () => { + assert.deepEqual(parseQuery("foo:bar").terms, ["foo:bar"]); +}); + +test("parseQuery: repeated same operator accumulates", () => { + assert.deepEqual( + parseQuery("project:web project:api").operators.project, + ["web", "api"], + ); +}); + +test("parseQuery: quoted operator value preserves spaces", () => { + assert.deepEqual( + parseQuery('project:"my project"').operators.project, + ["my project"], + ); +}); + +test("parseQuery: leading-colon token treated as term", () => { + assert.deepEqual(parseQuery(":foo").terms, [":foo"]); +}); + +test("parseQuery: operator with empty value falls through to term", () => { + assert.deepEqual(parseQuery("project:").terms, ["project:"]); +}); + +test("matches: empty query matches anything", () => { + assert.equal(matches({ title: "x" }, parseQuery("")), true); + assert.equal(matches({}, parseQuery("")), true); +}); + +test("matches: term in title (case-insensitive)", () => { + assert.equal(matches({ title: "Auth Bug Fix" }, parseQuery("auth")), true); + assert.equal(matches({ title: "" }, parseQuery("auth")), false); +}); + +test("matches: term in snippet, id, cwd, model", () => { + assert.equal(matches({ snippet: "fixing AUTH flow" }, parseQuery("auth")), true); + assert.equal(matches({ id: "abc-123-def" }, parseQuery("123")), true); + assert.equal(matches({ cwd: "/home/x/web-app" }, parseQuery("web")), true); + assert.equal(matches({ model: "claude-opus-4-7" }, parseQuery("opus")), true); +}); + +test("matches: term in project display_path or sanitized_name", () => { + const s = { _project: { display_path: "~/code/web-app", sanitized_name: "-home-x-web-app" } }; + assert.equal(matches(s, parseQuery("web")), true); + assert.equal(matches({ _project: { sanitized_name: "my-thing" } }, parseQuery("thing")), true); +}); + +test("matches: multi-term AND across fields", () => { + const s = { title: "Auth fix", snippet: "patches a critical bug" }; + assert.equal(matches(s, parseQuery("auth bug")), true); + assert.equal(matches(s, parseQuery("auth zzz")), false); +}); + +test("matches: project: operator", () => { + const s = { _project: { display_path: "~/web", sanitized_name: "x-web" } }; + assert.equal(matches(s, parseQuery("project:web")), true); + assert.equal(matches(s, parseQuery("project:api")), false); +}); + +test("matches: project: OR within same operator", () => { + const s = { _project: { display_path: "~/web" } }; + assert.equal(matches(s, parseQuery("project:api project:web")), true); +}); + +test("matches: model: operator", () => { + const s = { model: "claude-opus-4-7" }; + assert.equal(matches(s, parseQuery("model:opus")), true); + assert.equal(matches(s, parseQuery("model:sonnet")), false); +}); + +test("matches: has:tool", () => { + assert.equal(matches({ tool_count: 5 }, parseQuery("has:tool")), true); + assert.equal(matches({ tool_count: 0 }, parseQuery("has:tool")), false); + assert.equal(matches({}, parseQuery("has:tool")), false); +}); + +test("matches: has:cache", () => { + assert.equal(matches({ usage: { cache_read: 1000 } }, parseQuery("has:cache")), true); + assert.equal(matches({ usage: { cache_creation: 1 } }, parseQuery("has:cache")), true); + assert.equal(matches({ usage: {} }, parseQuery("has:cache")), false); + assert.equal(matches({}, parseQuery("has:cache")), false); +}); + +test("matches: has:model", () => { + assert.equal(matches({ model: "x" }, parseQuery("has:model")), true); + assert.equal(matches({ model: "" }, parseQuery("has:model")), false); +}); + +test("matches: has: AND across different flags", () => { + const s1 = { tool_count: 5, usage: { cache_read: 1 } }; + assert.equal(matches(s1, parseQuery("has:tool has:cache")), true); + const s2 = { tool_count: 0, usage: { cache_read: 1 } }; + assert.equal(matches(s2, parseQuery("has:tool has:cache")), false); +}); + +test("matches: unknown has-flag is permissive (does not filter)", () => { + assert.equal(matches({ title: "x" }, parseQuery("has:flying-cars")), true); +}); + +test("matches: before: filters by last_at", () => { + const s = { last_at: "2026-01-10T00:00:00Z" }; + assert.equal(matches(s, parseQuery("before:2026-01-15")), true); + assert.equal(matches(s, parseQuery("before:2026-01-01")), false); +}); + +test("matches: after: filters by last_at", () => { + const s = { last_at: "2026-02-01T00:00:00Z" }; + assert.equal(matches(s, parseQuery("after:2026-01-15")), true); + assert.equal(matches(s, parseQuery("after:2026-03-01")), false); +}); + +test("matches: session without last_at fails any date filter", () => { + assert.equal(matches({}, parseQuery("before:2026-01-15")), false); + assert.equal(matches({}, parseQuery("after:2026-01-15")), false); +}); + +test("matches: AND across different operators", () => { + const s = { + title: "auth fix", + model: "opus", + _project: { display_path: "~/web" }, + }; + assert.equal(matches(s, parseQuery("auth model:opus project:web")), true); + assert.equal(matches(s, parseQuery("auth model:opus project:api")), false); +}); + +test("matches: graceful on missing fields", () => { + assert.equal(matches({}, parseQuery("foo")), false); +}); + +test("highlight: empty text returns empty string", () => { + assert.equal(highlight("", ["foo"]), ""); + assert.equal(highlight(null, ["foo"]), ""); + assert.equal(highlight(undefined, ["foo"]), ""); +}); + +test("highlight: empty terms returns escaped text", () => { + assert.equal(highlight("hello & ", []), "hello & <world>"); + assert.equal(highlight("plain", null), "plain"); +}); + +test("highlight: wraps single term", () => { + assert.equal(highlight("auth fix", ["auth"]), "auth fix"); +}); + +test("highlight: case-insensitive but preserves original case", () => { + assert.equal(highlight("Auth Bug", ["auth"]), "Auth Bug"); +}); + +test("highlight: wraps multiple terms", () => { + assert.equal( + highlight("auth and bug", ["auth", "bug"]), + "auth and bug", + ); +}); + +test("highlight: escapes HTML in non-matched parts", () => { + assert.equal( + highlight("foo