survos/data-bundle centralizes dataset filesystem conventions for
dataset-driven Symfony applications.
Despite the historical name, this bundle is not the owner of shared semantic metadata contracts. It manages where dataset files, provider metadata, Pixie databases, run artifacts, cache files, and related JSONL outputs live.
For shared vocabulary and typed metadata contracts, use
survos/data-contracts.
This bundle provides:
DataPaths: root-level path resolution underAPP_DATA_DIRDatasetPaths: dataset-scoped path helpers- dataset metadata loading and ensuring
DatasetInfo/Providerregistry entities- provider snapshot encoding
- dataset context helpers for console/import workflows
- commands for browsing, diagnosing, and resolving dataset paths
This bundle does not provide:
- Dublin Core vocabulary constants
- collection-object DTO contracts
- metadata claim storage
- AI workflow execution
- media upload, IIIF, or mediary publishing
- import/normalize/profile logic
survos/data-contracts: shared metadata vocabulary and DTO contracts.survos/data-bundle: dataset paths, provider storage, and dataset registry.survos/import-bundle: import/convert workflows that may ask this bundle for dataset paths.survos/ai-workflow-bundle: task execution in apps that own subject context.- claims bundle: tracked metadata assertions with provenance and confidence.
survos/media-bundle: media identity and mediary publishing.
The dependency direction should stay honest: packages should require
survos/data-contracts directly when they only need DcTerms, ContentType,
or metadata DTOs. Do not require this bundle just to get vocabulary classes.
All dataset work lives under a single root directory:
APP_DATA_DIR=/absolute/path/to/data/rootThe bundle avoids repository-relative paths and gives services and commands one place to ask for canonical locations.
Placement is decided by one rule: can I regenerate it from another tier + code?
| Tier | Holds | Backed up / shipped? |
|---|---|---|
vault/ |
acquired source + AI claims + _vocab/ reference |
yes — durable, mirror of HF/S3 |
cache/ |
bulky re-fetchable materializations (clones, firehose, unzipped) | no |
work/ |
pipeline output — disposable (rm -rf work/<p>/<c> is always safe) |
no |
folio/ |
built .folio databases |
no — rebuilt from work |
$APP_DATA_DIR/
vault/
<provider>/<code>/ # acquired source files
ai/claims.jsonl # AI claims (expensive → durable, never in work)
_vocab/ # global reference vocab (non-provider → _)
cache/<provider>/... # disposable, re-fetchable
work/<provider>/<code>/
_meta/dataset.json # config (portal from code)
_raw/ # source view (portal → vault; often a symlink)
norm/ # normalized cores + term/termSet + link/linkType
voc/ # extracted vocab (feeds AI content_type mapping)
trans/ # translations
_folio/ # assembled folio-input (portal → folio tier)
folio/<provider>/<code>.folio
Work-tree stage directory names have no numeric prefixes and sort in pipeline
order; _-prefixed dirs are tier portals (config / vault / folio), often symlinks,
not computed stages.
Survos\DatasetBundle\Enum\Stage owns stage identity and dir names. The backed value
is the stable semantic key (events, import:convert --stage); Stage::dir() is the only
place directory names live; Stage::fromKey() is the fail-loud string boundary (unknown →
throws). Reference Stage cases in code — do not pass raw stage strings.
$paths->stageDir('dc/tb09jw350', Stage::Normalize); // .../normcomposer require survos/data-bundleSet the root directory:
export APP_DATA_DIR=/absolute/path/to/data/rootInject DataPaths for root and dataset path resolution:
use Survos\DataBundle\Service\DataPaths;
final class SomeService
{
public function __construct(
private readonly DataPaths $paths,
) {
}
}Common dataset paths:
$paths->datasetDir('dc/tb09jw350');
$paths->extractDir('dc/tb09jw350');
$paths->extractFile('dc/tb09jw350');
$paths->normalizeDir('dc/tb09jw350');
$paths->normalizeFile('dc/tb09jw350');
$paths->profileDir('dc/tb09jw350');
$paths->profileFile('dc/tb09jw350');
$paths->termsDir('dc/tb09jw350');Pixie paths:
$paths->pixieTenantDb('larco');Operational directories:
$paths->runsDir;
$paths->cacheDir;Current command names retain the historical data:* prefix:
bin/console data:path dc/tb09jw350 20_normalize
bin/console data:head dc/tb09jw350 20_normalize --limit=5
bin/console data:diag dc/tb09jw350
bin/console data:browse
bin/console data:scan-datasetsThese may eventually move to dataset:* aliases when the bundle is renamed.
Ensure global roots exist:
$paths->ensureRootDirs();Ensure standard dataset stage directories exist:
$paths->ensureDatasetDirs('dc/tb09jw350');For small metadata files:
$paths->atomicWrite($path, $contents);The write uses a temporary file in the same directory followed by an atomic rename.
- Dataset path conventions are centralized.
- Paths are semantic, not stringly typed.
- Dataset/provider storage concerns stay separate from semantic metadata contracts.
- Import, AI workflow, claims, and media publishing remain in their own packages.
- The bundle should stay boring and infrastructure-focused.
The better long-term name is survos/dataset-bundle. See
docs/rename-to-dataset-bundle.md.