diff --git a/docs/testing/TEST_COVERAGE.md b/docs/testing/TEST_COVERAGE.md
index 9ff65d52..c066c211 100644
--- a/docs/testing/TEST_COVERAGE.md
+++ b/docs/testing/TEST_COVERAGE.md
@@ -51,6 +51,8 @@
 - Type conversion behavior
 - Type precedence in operations
 
+**Nested Structure Coverage**: For operators that collect or pass through document values (e.g. `$push`, `$addToSet`, `$mergeObjects`), test deeply nested arrays-of-objects with embedded arrays (e.g. `{"data": {"users": [{"profile": {"name": "Alice", "scores": [85, 90]}}]}}`) to verify the full structure is preserved without flattening or truncation. Also test array traversal via field paths (e.g. `$a.b` on `{"a": [{"b": 1}, {"b": 2}]}`) to verify correct collection of values from arrays-of-objects.
+
 ---
 
 ### 2. Arithmetic Operator Coverage
@@ -98,6 +100,7 @@ Note: Distinguish between fractional doubles (2.5) and whole-number doubles (3.0
 - **Underflow**: `INT32_MIN` - 1 → long, `INT64_MIN` - 1 → double
 - **Sign handling**: positive, negative, zero
 - **Negative zero**: `DOUBLE_NEGATIVE_ZERO` → verify behavior (some operators normalize to `0.0`, others preserve `-0.0`); `DECIMAL128_NEGATIVE_ZERO` → verify; `NumberDecimal("-0E+N")` and `NumberDecimal("-0E-N")` → verify exponent preservation
+- **Special float values mixed with non-numeric types**: For operators that collect or pass through values (e.g. `$push`, `$addToSet`), test special float values (NaN, Infinity, -Infinity, -0.0) alongside non-numeric types (string, boolean, null, object, array) in the same group to verify all values are preserved without coercion or loss.
 - **Special values**: MinKey, MaxKey combinations
 - **Two's complement asymmetry (single-input operators)**: `INT32_MIN` has no positive int counterpart → must promote to long; `INT64_MIN` has no positive long counterpart → verify overflow/error behavior
 - **Double precision boundaries**: `DOUBLE_NEAR_MAX`, `DOUBLE_MIN_SUBNORMAL`, `DOUBLE_MIN_NEGATIVE_SUBNORMAL`, `DOUBLE_NEAR_MIN`, `DOUBLE_NEGATIVE_ZERO` → use `NUMERIC_DOUBLE` from `test_constants.py`
@@ -362,6 +365,7 @@ For each invalid_type in [string, object, array, ...]:
 - Primary operation on basic input
 - Empty input and non-existent collection both produce correct output without error
 - Works as the sole pipeline stage
+- For `$group`: test compound `_id` (grouping by multiple fields, e.g. `{"_id": {"region": "$region", "status": "$status"}}`), nested field path `_id` (e.g. `"_id": "$user.dept"`), and large-scale multi-group aggregation (e.g. 10,000 documents across 100+ groups) to verify group partitioning at scale beyond single-group `_id: null`
 
 **Parameter Validation**:
 - Test every BSON type against the parameter. Numeric stages (`$limit`, `$skip`, `$sample`) accept int32, int64, whole-number double, whole-number Decimal128. Document stages (`$match`, `$project`, `$group`, `$set`) reject non-documents. String stages (`$count`, `$unwind`) reject non-strings.
@@ -378,6 +382,7 @@ For each invalid_type in [string, object, array, ...]:
 - Multi-stage interaction tests belong in the parent `stages/` directory, not in individual stage folders. Per `FOLDER_STRUCTURE.md`, interactions between same-level features go in the parent folder (e.g., `stages/test_stages_combination_sort.py`, `stages/test_stages_position_match.py`).
 - Test interactions where ordering affects results or where adjacent stages compose non-obviously (e.g., optimization coalescence, count-modifying vs non-count-modifying intervening stages, additive vs min-taking consecutive stages)
 - Cover common multi-stage usage patterns for the stage under test
+- Test multiple `$group` stages in a single pipeline (e.g. `$group` → `$sort` → `$group`), where each stage performs actual grouping (not just `_id: null`). The second `$group` should re-aggregate the output of the first, verifying that accumulated results (arrays, counts, etc.) can be correctly consumed as input to another `$group`.
 
 **Out of Scope**:
 - Cross-cutting concerns (views, capped collections, timeseries) belong in their own directories
@@ -435,6 +440,10 @@ For each invalid_type in [string, object, array, ...]:
   The expression-form of dual-form operators (`$max`, `$min`, `$sum`, `$avg`, `$first`, `$last`, etc.) is tested separately under `tests/core/operator/expressions/accumulator/$op/` and is out of scope for the
   accumulator-form file.
 
+  **Actual Grouping Requirement**: Accumulator tests must not rely exclusively on `_id: null` (single-group). Each test category (null/missing, special values, core behavior, etc.) should include at least one test with actual multi-group grouping (e.g. `_id: "$cat"`) to verify accumulators reset correctly across group boundaries.
+
+  **Multiple Accumulators in Single $group**: Test multiple accumulators of the same type in a single `$group` (e.g. `{"$group": {"_id": "$cat", "xs": {"$push": "$x"}, "ys": {"$push": "$y"}}}`) with actual grouping, not just `_id: null`. Each accumulator should independently handle its own field's null/missing values without interfering with siblings. This is distinct from sibling-accumulator interaction tests (which test different accumulator types together, e.g. `$push` + `$sum`) — it verifies that multiple instances of the same accumulator independently collect from their respective fields without cross-contamination.
+
   **Expression Error Propagation**:
   Errors raised during sub-expression evaluation propagate through the accumulator without being caught:
   - `{$op: {$divide: [1, "$v"]}}` → `CONVERSION_FAILURE_ERROR` (field violation)
@@ -449,6 +458,10 @@ For each invalid_type in [string, object, array, ...]:
 
   - For order-dependent accumulators, tests asserting a specific result must include a preceding `$sort`. Tests without `$sort` are flaky. (e.g. $first)
   - For order-independent accumulators, the result must be the same regardless of input order. Verify this by running the same input twice with different `$sort` directions and asserting identical results.
+  - Sort coverage must include compound sorts with mixed directions (e.g. `{"$sort": {"priority": 1, "status": -1, "timestamp": 1}}`) and sorts on nested field paths (e.g. `{"$sort": {"user.dept": 1}}`).
+
+  **Large-Scale Result Verification**:
+  When testing accumulators at scale (1000+ documents), verify the actual content of the result, not just its count. Prefer building the expected result with a loop (e.g. `expected=[{"_id": None, "result": list(range(10_000))}]`) over using server-side aggregates like `$size` or `$count` in a `$project`. A count-only check can pass even if values are duplicated, dropped, or corrupted. When a full element-by-element expected list is impractical (e.g. multi-group aggregation), use server-side content checks (`$sum`, `$min`, `$max` on the pushed array) as a secondary option.
 
   **Tested in Other Folders** (in scope, but add under a different folder):
   - **Host-stage compatibility** — when adding a new accumulator, add one smoke case for each host stage that supports it (`$group`, `$bucket`, `$bucketAuto`, `$setWindowFields`) under that stage's folder