Skip to content

[Ops]: Add Chrysalis ingestion wrapper#169

Draft
tomvothecoder wants to merge 6 commits into
E3SM-Project:mainfrom
tomvothecoder:feature/154-ingestion-sites
Draft

[Ops]: Add Chrysalis ingestion wrapper#169
tomvothecoder wants to merge 6 commits into
E3SM-Project:mainfrom
tomvothecoder:feature/154-ingestion-sites

Conversation

@tomvothecoder

Copy link
Copy Markdown
Collaborator

Description

This adds a scheduler-agnostic HPC archive ingestor entrypoint and a thin Chrysalis site wrapper for existing Jenkins-driven metadata ingestion.

Checklist

  • Code follows project style guidelines
  • Self-reviewed code
  • No new warnings
  • Tests added or updated (if needed)
  • All tests pass (locally and CI/CD)
  • Documentation/comments updated (if needed)
  • Breaking change noted (if applicable)

Deployment Notes (if any)

No special deployment steps.

Local validation is currently blocked because PostgreSQL was unavailable at 127.0.0.1, so make backend-test and the targeted ingestion test file could not complete in this environment.

@tomvothecoder tomvothecoder requested a review from TonyB9000 May 6, 2026 22:33
@TonyB9000

Copy link
Copy Markdown
Collaborator

@tomvothecoder Once I am clear on the boundaries to the term "NERSC ingestion wrapper", I should be able to comprehent "Chrysalis ingestion wrapper". The term "scheduler-agnostic" refers to Jenkins? (I always considered cron to be universal...).

@TonyB9000 TonyB9000 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Configures a call to the hpc_archive_ingestor. Understandable.

What process sets "SIMBOARD_API_BASE_URL" and "SIMBOARD_API_TOKEN"?

@TonyB9000

Copy link
Copy Markdown
Collaborator

Hmmmm. The "Tom Requested your review" took me to the page with 7 files to examine, each with a "submit-review" option. As soon as I completed the first one, all 7 vanished...

@tomvothecoder

tomvothecoder commented May 13, 2026

Copy link
Copy Markdown
Collaborator Author

Configures a call to the hpc_archive_ingestor. Understandable.

What process sets "SIMBOARD_API_BASE_URL" and "SIMBOARD_API_TOKEN"?

Hmmmm. The "Tom Requested your review" took me to the page with 7 files to examine, each with a "submit-review" option. As soon as I completed the first one, all 7 vanished...

Accidentally tagged you for review. I meant to assign this PR you. It is fixed now.

@tomvothecoder tomvothecoder added the type: enhancement New feature or request label May 13, 2026
@TonyB9000

Copy link
Copy Markdown
Collaborator

@tomvothecoder "Accidentally tagged you for review". OK, (I think colleges should offer a master's program in github).

@tomvothecoder

Copy link
Copy Markdown
Collaborator Author

Chrysalis and other non-NERSC sites require upload-based ingestion rather than path-based ingestion, so follow-up work is tracked in #207 for a state-first HPC upload flow with DB-backed dedupe parity.

@TonyB9000

Copy link
Copy Markdown
Collaborator

Using the "upload-based' vs "path-based" terminology, my thought was that when the NERSC upload-receiving system was deliverd an upload from a non-NERSC system, it could open it in the existing NERSC PA-directory under (say) "From_crysalis/<new_exec_ids>" and then process it with the existing "path-based" codes - assuming PACE would not interfere with it (and vice-versa). But on second thought, to avoid PACE crossing, it would be best to open it in a separate "PACE-unaware" directory.

@TonyB9000

TonyB9000 commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator

@tomvothecoder I am preparing to exercise "hpc_upload_archive_ingestor.py" on chrysalis, to see the logs and flow (in dry-run) in action, discover parameter faults, etc.

QUESTION: Although, on NERSC, the backend ingestion is "path-based" (returnsp paths for ingestion), it could in principle run the "https-transfer-based" codes just as easily. I might try a dryrun on NESRC/Perlmutter first, since that configuration is already a known item. Then, differences in behavior on chrysalis would stand out. Does that make sense?

@tomvothecoder tomvothecoder force-pushed the feature/154-ingestion-sites branch from 22a1a88 to feae197 Compare June 4, 2026 20:26
@TonyB9000

TonyB9000 commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator

@tomvothecoder Apologies if I'm doing this wrong.

I attempted to test the "hpc_upload" on NESRC, thinking "--help" might be helpful. To get started, I needed an environment where I could install things, so:

After

    python3.11 -m venv ~/envs/test_simboard
    source ~/envs/test_simboard/bin/activate
    python3.11 -m pip install --upgrade pip

    python3.11 -m pip install python-dateutil
    pip install pydantic
    pip install fastapi_users
 
The (bash script) commands:

    REPO_ROOT="/global/homes/t/tonyb/gitrepo/simboard/backend"
    SCRIPT="$REPO_ROOT/app/scripts/ingestion/hpc_upload_archive_ingestor.py"

    PYTHONPATH="$REPO_ROOT"
    python3.11 "$SCRIPT" --help

Produces the following output:

2026-06-04 15:02:26,464 [INFO]: nersc_archive_ingestor.py(_log_event:1344) >> ts=2026-06-04T22:02:26.464377+00:00 event=run_started archive_root=/performance_archive mode=ingest
2026-06-04 15:02:26,464 [INFO]: nersc_archive_ingestor.py(_log_event:1344) >> ts=2026-06-04T22:02:26.464638+00:00 event=startup_configuration_begin
2026-06-04 15:02:26,464 [INFO]: nersc_archive_ingestor.py(_log_event:1344) >> ts=2026-06-04T22:02:26.464749+00:00 event=summary_table row_count=10 rows="api.api_base_url=http://backend:8000 | api.endpoint_url=http://backend:8000/api/v1/ingestions/from-hpc-upload | api.state_endpoint_url=http://backend:8000/api/v1/ingestions/state | paths.archive_root=/performance_archive | runtime.machine_name=perlmutter | runtime.dry_run=false | runtime.max_cases_per_run=null | runtime.max_attempts=3 | runtime.request_timeout_seconds=60 | auth.has_api_token=false" title=startup_configuration
2026-06-04 15:02:26,464 [INFO]: nersc_archive_ingestor.py(_log_event:1344) >> ts=2026-06-04T22:02:26.464819+00:00 event=startup_configuration_end
2026-06-04 15:02:26,464 [INFO]: nersc_archive_ingestor.py(_log_event:1344) >> ts=2026-06-04T22:02:26.464876+00:00 event=archive_root_missing archive_root=/performance_archive
2026-06-04 15:02:26,464 [INFO]: nersc_archive_ingestor.py(_log_event:1344) >> ts=2026-06-04T22:02:26.464929+00:00 event=run_finished duration_seconds=0.001 exit_code=1 mode=ingest

I now see there is no commandline parsing.  I need to set “dry_run” as an environment variable so that the auto-generated config will pick it up.  I must have missed where the docs explain setting the environment variables.  I assume I can set them in my “run_script”.

Comment thread backend/app/scripts/ingestion/sites/chrysalis.sh Outdated
@tomvothecoder

tomvothecoder commented Jun 5, 2026

Copy link
Copy Markdown
Collaborator Author

Hey Tony, happy to help and no apologies needed.

I attempted to test the "hpc_upload" on NESRC, thinking "--help" might be helpful. To get started, I needed an environment where I could install things, so:

After

    python3.11 -m venv ~/envs/test_simboard
    source ~/envs/test_simboard/bin/activate
    python3.11 -m pip install --upgrade pip

    python3.11 -m pip install python-dateutil
    pip install pydantic
    pip install fastapi_users
 

SimBoard defines the Python backend dependencies in pyproject.toml and uses uv for dependency management.

You can run make backend-install if you only need a Python env (source).

The (bash script) commands:

REPO_ROOT="/global/homes/t/tonyb/gitrepo/simboard/backend"
SCRIPT="$REPO_ROOT/app/scripts/ingestion/hpc_upload_archive_ingestor.py"

PYTHONPATH="$REPO_ROOT"
python3.11 "$SCRIPT" --help

Produces the following output:

2026-06-04 15:02:26,464 [INFO]: nersc_archive_ingestor.py(_log_event:1344) >> ts=2026-06-04T22:02:26.464377+00:00 event=run_started archive_root=/performance_archive mode=ingest
2026-06-04 15:02:26,464 [INFO]: nersc_archive_ingestor.py(_log_event:1344) >> ts=2026-06-04T22:02:26.464638+00:00 event=startup_configuration_begin
2026-06-04 15:02:26,464 [INFO]: nersc_archive_ingestor.py(_log_event:1344) >> ts=2026-06-04T22:02:26.464749+00:00 event=summary_table row_count=10 rows="api.api_base_url=http://backend:8000 | api.endpoint_url=http://backend:8000/api/v1/ingestions/from-hpc-upload | api.state_endpoint_url=http://backend:8000/api/v1/ingestions/state | paths.archive_root=/performance_archive | runtime.machine_name=perlmutter | runtime.dry_run=false | runtime.max_cases_per_run=null | runtime.max_attempts=3 | runtime.request_timeout_seconds=60 | auth.has_api_token=false" title=startup_configuration
2026-06-04 15:02:26,464 [INFO]: nersc_archive_ingestor.py(_log_event:1344) >> ts=2026-06-04T22:02:26.464819+00:00 event=startup_configuration_end
2026-06-04 15:02:26,464 [INFO]: nersc_archive_ingestor.py(_log_event:1344) >> ts=2026-06-04T22:02:26.464876+00:00 event=archive_root_missing archive_root=/performance_archive
2026-06-04 15:02:26,464 [INFO]: nersc_archive_ingestor.py(_log_event:1344) >> ts=2026-06-04T22:02:26.464929+00:00 event=run_finished duration_seconds=0.001 exit_code=1 mode=ingest

I now see there is no commandline parsing. I need to set “dry_run” as an environment variable so that the auto-generated config will pick it up. I must have missed where the docs explain setting the environment variables. I assume I can set them in my “run_script”.

I'd checkout this branch now that I've rebased it on the latest main commit.

The chrysalis.sh bash script exports environment variables and wraps hpc_upload_archive_ingestor.py. You can try experimenting with that script. More info here: https://github.com/tomvothecoder/simboard/tree/feature/154-ingestion-sites/backend/app/scripts#hpc-upload-archive-ingestor.

@TonyB9000

Copy link
Copy Markdown
Collaborator

@tomvothecoder I get the latest stuff - but I have made progress. My latest run_script (NERSC dry_run test) says:

REPO_ROOT="/global/homes/t/tonyb/gitrepo/simboard/backend"
WORKDIR="/global/homes/t/tonyb/test/simboard"
SCRIPT="$REPO_ROOT/app/scripts/ingestion/hpc_upload_archive_ingestor.py"

export PYTHONPATH="$REPO_ROOT"
export DRY_RUN=True
python3.11 "$SCRIPT"

The output indicates that I am missing "archive_root” and “has_api_token”.

By examining the "nersc" "_build_config" function, I can see what variables exist to push into the environment.

I'll checkout branch #169 on both NERSC and Chrysalis to do comparisons in outputs.

@TonyB9000

Copy link
Copy Markdown
Collaborator

@tomvothecoder git gets me again:

You wrote: "I'd checkout this branch now that I've rebased it on the latest main commit."

is "this branch" off of main, as you had advised? Or is it off of a fork??

((test_simboard) ) (base) [ac.bartoletti1@chrlogin1 simboard]$ git branch -a

  • main
    remotes/origin/HEAD -> origin/main
    remotes/origin/copilot/analyze-simboard-devops-issues
    remotes/origin/copilot/check-copilot-agent-tokens
    remotes/origin/copilot/enhance-simulation-details-page
    remotes/origin/copilot/fix-hpc-filepaths-issue
    remotes/origin/dev-ai
    remotes/origin/fix/181-archive-path-substitution
    remotes/origin/main

When I get too confused, I do a clean "git clone". Then I can do one of these:

To pull a remote branch down from remote:

    git fetch --all --prune
    git checkout -b newbranchname origin/newbranchname

To checkout a remote branch pushed but not merged to main/master

    git fetch origin <the_remote_branch_name>
    git checkout -b <any_new_local_name> origin/<the_remote_branch_name>

to fetch a branch from a remote fork: (example)

    git remote add tomvothecoder https://github.com/tomvothecoder/simboard.git
    git fetch tomvothecoder
    git checkout -b feature/154-ingestion-sites tomvothecoder/feature/154-ingestion-sites

Which is appropriate in this case?

@TonyB9000

Copy link
Copy Markdown
Collaborator

@tomvothecoder If I pull down a branch off of someone's fork, am I in that fork, or can I pull that into a new branch of my local main? The persistence of branches and forks, between local and remote, is a bit of a mystery.

@tomvothecoder

Copy link
Copy Markdown
Collaborator Author

@tomvothecoder If I pull down a branch off of someone's fork, am I in that fork, or can I pull that into a new branch of my local main? The persistence of branches and forks, between local and remote, is a bit of a mystery.

  • Upstream -> E3SM-Project/simboard
  • Fork -> tomvothecoder/simboard

This branch (tomvothecoder:feature/154-ingestion-sites) is on my fork (tomvothecoder/simboard), not on upstream (E3SM-Project/simboard) You need to add my fork as a remote git source to git checkout branches from my fork.

Something like this (I did not verify correctness):

git remote add tomvothecoder https://github.com/tomvothecoder/simboard
git checkout tomvothecoder feature/154-ingestion-ites 

I usually work directly on upstream and not fork when possible, but in this case I use a fork for separate testing purposes.

@TonyB9000

Copy link
Copy Markdown
Collaborator

@tomvothecoder Sorry, I guess I must pull from your fork.

Quick test: I cd to ``/home/ac.bartoletti1/gitrepo/simboard/backend" and issue

python3.12 -m app.scripts.ingestion.nersc_archive_ingestor --api-base-url http://backend:8000 --machine-name chrysalis

The result:

2026-06-05 16:23:29,964 [INFO]: nersc_archive_ingestor.py(_log_event:1344) >> ts=2026-06-05T21:23:29.964677+00:00 event=run_started archive_root=/performance_archive mode=ingest
2026-06-05 16:23:29,964 [INFO]: nersc_archive_ingestor.py(_log_event:1344) >> ts=2026-06-05T21:23:29.964921+00:00 event=startup_configuration_begin
2026-06-05 16:23:29,965 [INFO]: nersc_archive_ingestor.py(_log_event:1344) >> ts=2026-06-05T21:23:29.965032+00:00 event=summary_table row_count=10 rows="api.api_base_url=_fake_url_ | api.endpoint_url=_fake_url_/api/v1/ingestions/from-path | api.state_endpoint_url=_fake_url_/api/v1/ingestions/state | paths.archive_root=/performance_archive | runtime.machine_name=perlmutter | runtime.dry_run=false | runtime.max_cases_per_run=null | runtime.max_attempts=3 | runtime.request_timeout_seconds=60 | auth.has_api_token=true" title=startup_configuration
2026-06-05 16:23:29,965 [INFO]: nersc_archive_ingestor.py(_log_event:1344) >> ts=2026-06-05T21:23:29.965082+00:00 event=startup_configuration_end
2026-06-05 16:23:29,965 [INFO]: nersc_archive_ingestor.py(_log_event:1344) >> ts=2026-06-05T21:23:29.965118+00:00 event=archive_root_missing archive_root=/performance_archive
2026-06-05 16:23:29,965 [INFO]: nersc_archive_ingestor.py(_log_event:1344) >> ts=2026-06-05T21:23:29.965168+00:00 event=run_finished duration_seconds=0.0 exit_code=1 mode=ingest

@TonyB9000

Copy link
Copy Markdown
Collaborator

@tomvothecoder The line in "chrysalis.sh"

script_dir="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"; echo $script_dir
(prints "/home/ac.bartoletti1/test/simboard")

clearly wont work for defining "backend_root" as backend_root="$(cd "${script_dir}/../../../.." && pwd)"

I will modify chrysalis.sh to provide a "backend_root" that does not depend upon the user location., at least for test purposes.

@TonyB9000

Copy link
Copy Markdown
Collaborator

@tomvothecoder Works better now that it can find "backentd/apps"

When I use this for "chrysalis.sh":

GITREPO="/home/ac.bartoletti1/gitrepo"

: "${SIMBOARD_API_BASE_URL:?SIMBOARD_API_BASE_URL is required}"
: "${SIMBOARD_API_TOKEN:?SIMBOARD_API_TOKEN is required}"

export MACHINE_NAME="${MACHINE_NAME:-chrysalis}"
export PERF_ARCHIVE_ROOT="${PERF_ARCHIVE_ROOT:-/lcrc/group/e3sm/PERF_Chrysalis/performance_archive}"
export STATE_PATH="${STATE_PATH:-${PERF_ARCHIVE_ROOT}/../simboard-ingestion-state.json}"
export DRY_RUN="${DRY_RUN:-true}"

backend_root="$GITREPO/simboard/backend"
python_bin="${PYTHON_BIN:-python}"

cd "${backend_root}"
exec "${python_bin}" -m app.scripts.ingestion.hpc_upload_archive_ingestor

and issue these exports:

export SIMBOARD_API_BASE_URL=" http://backend:8000"
export SIMBOARD_API_TOKEN="_fake_token_"

I get:

2026-06-05 17:17:50,763 [INFO]: nersc_archive_ingestor.py(_log_event:1344) >> ts=2026-06-05T22:17:50.763649+00:00 event=run_started archive_root=/lcrc/group/e3sm/PERF_Chrysalis/performance_archive mode=dry-run
2026-06-05 17:17:50,763 [INFO]: nersc_archive_ingestor.py(_log_event:1344) >> ts=2026-06-05T22:17:50.763967+00:00 event=startup_configuration_begin
2026-06-05 17:17:50,764 [INFO]: nersc_archive_ingestor.py(_log_event:1344) >> ts=2026-06-05T22:17:50.764094+00:00 event=summary_table row_count=10 rows="api.api_base_url=\" http://backend:8000\" | api.endpoint_url=\" http://backend:8000/api/v1/ingestions/from-hpc-upload\" | api.stat
e_endpoint_url=\" http://backend:8000/api/v1/ingestions/state\" | paths.archive_root=/lcrc/group/e3sm/PERF_Chrysalis/performance_archive | runtime.machine_name=chrysalis | runtime.dry_run=true | runtime.max_cases_per_run=null | runtime.max_attempts=3 | runtime.request_timeout_second
s=60 | auth.has_api_token=true" title=startup_configuration
2026-06-05 17:17:50,764 [INFO]: nersc_archive_ingestor.py(_log_event:1344) >> ts=2026-06-05T22:17:50.764155+00:00 event=startup_configuration_end
2026-06-05 17:17:50,816 [INFO]: nersc_archive_ingestor.py(_log_event:1344) >> ts=2026-06-05T22:17:50.816097+00:00 event=state_fetch_failed error="URL error: [Errno -2] Name or service not known" machine_name=chrysalis status_code=null
2026-06-05 17:17:50,816 [INFO]: nersc_archive_ingestor.py(_log_event:1344) >> ts=2026-06-05T22:17:50.816211+00:00 event=run_finished duration_seconds=0.053 exit_code=1 mode=dry-run

I guess, even "dry_run" requires real URLs and API_tokens. That is because we need "state" up front.

@TonyB9000

TonyB9000 commented Jun 5, 2026

Copy link
Copy Markdown
Collaborator

Hi @tomvothecoder The document also says:

One-case-per-request rule:

  • Each upload request contains exactly one case directory.
  • case_path is sent alongside the archive and becomes the stable dedupe key in the ingestion audit table.
  • Browser/manual uploads still use /api/v1/ingestions/from-upload; this runner does not call that endpoint.

The term "alongside the archive" is a bit ambiguous. Would this be accurate?

  • Each upload request contains exactly one case directory, and one or more newly-completed jlid archives.
  • case_path is sent alongside the archives, and (case_id + jlid) becomes the stable dedupe key in the ingestion audit table.
  • Browser/manual uploads still use /api/v1/ingestions/from-upload; this runner does not call that endpoint.

Or am I misunderstanding the intent?

@tomvothecoder

tomvothecoder commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator Author

I guess, even "dry_run" requires real URLs and API_tokens. That is because we need "state" up front.

Great to see the progress!

Yes, the dry run needs to query the SimBoard database via the REST API. I will send the API_TOKEN over encrypted email to you.

Hi @tomvothecoder The document also says:

One-case-per-request rule:

* Each upload request contains exactly one case directory.

* case_path is sent alongside the archive and becomes the stable dedupe key in the ingestion audit table.

* Browser/manual uploads still use /api/v1/ingestions/from-upload; this runner does not call that endpoint.

The term "alongside the archive" is a bit ambiguous. Would this be accurate?

* Each upload request contains exactly one case directory, and one or more newly-completed jlid archives.

* case_path is sent alongside the archives, and (case_id + jlid) becomes the stable dedupe key in the ingestion audit table.

* Browser/manual uploads still use /api/v1/ingestions/from-upload; this runner does not call that endpoint.

Or am I misunderstanding the intent?

Your info sounds more accurate, thanks for the suggestion. Can you point me to the source document with this info? I will update it.

@TonyB9000

TonyB9000 commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

@tomvothecoder Running "chrysalis.sh" with the full (DRY_RUN) parameters yieded the following summary (folded for readability):

event=summary_table
    row_count=9
    rows="mode=dry-run 
        | discovered_cases=746 
        | candidate_cases=746 
        | execution_dirs_scanned=3155 
        | execution_dirs_accepted=1649 
        | skipped_incomplete=1506
        | skipped_invalid=0 
        | candidate_logs_emitted=20 
        | candidate_logs_suppressed=726" 
    title=dry_run_summary
event=run_finished
    duration_seconds=374.342
    exit_code=0
    mode=dry-run

Questions that arise:

  • What distinguishes "discovered cases" from "candidate cases"?
  • What distinguishes "skipped_incomplete" from "skipped_invalid"?
  • Where is "skipped_already_accepted = 0"? Perhaps this test is unrealistic, as no "state" of previous accepted submissions exists,.
  • Why is there no count of "state" returned from the database? Was the query restricted to Chrysalis-only? Why is the DB query not indicated?
  • What is "candidate_logs_emitted/suppressed"?

Observation: The bulk of work getting to this point involved stuffing the right ENV VARS and having created an environment where misc modules like "dateutils" could be installed. On Chrysalis, I performed

    python3.12 -m venv ~/envs/test_simboard
    source ~/envs/test_simboard/bin/activate
    python3.12 -m pip install --upgrade pip

    python3.12 -m pip install python-dateutil
    pip install pydantic
    pip install fastapi_users

On NERSC/Perlmutter, I simply replaced "python3.12" with "python3.11". I intend to perform the same test on Perlmutter, just to exercise the mechanisms of networking.

@TonyB9000

Copy link
Copy Markdown
Collaborator

@tomvothecoder For comparison, running the equivalent commands on perlmutter (swapping our parameters where necessary), we obtain the summary:

event=summary_table
    row_count=9 
    rows="mode=dry-run 
        | discovered_cases=1289 
        | candidate_cases=1289 
        | execution_dirs_scanned=2514 
        | execution_dirs_accepted=1648 
        | skipped_incomplete=866 
        | skipped_invalid=0
        | candidate_logs_emitted=20
        | candidate_logs_suppressed=1269"
    title=dry_run_summary
event=run_finished
    duration_seconds=23.689
    exit_code=0
    mode=dry-run

I suppose I should re-run the chrysalis test, using "OLD_PERF" as the root_PA directory. It is HUGE.

@tomvothecoder

tomvothecoder commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator Author

What distinguishes "discovered cases" from "candidate cases"?

discovered_casesare everything the archive scan finds that looks like a case.
candidate_cases are the subset that SimBoard does not already know about and may ingest.

Since this is a first-time dry-run on the Chrysalis performance_archive directory, it is expected that discovered_cases and candidates_cases are the same.

What distinguishes "skipped_incomplete" from "skipped_invalid"?

skipped_incomplete means required metadata was missing.
skipped_invalid means the metadata or path looked wrong, unreadable, or unusable.

Where is "skipped_already_accepted = 0"? Perhaps this test is unrealistic, as no "state" of previous accepted submissions exists.

That exact counter is not in nersc_archive_ingestor.py. The script checks existing SimBoard ingestion state and filters out already-known execution IDs, but it does not use the term “accepted” or expose a skipped_already_accepted count.

So yes: a test expecting that exact field is probably unrealistic or stale.

Why is there no count of "state" returned from the database? Was the query restricted to Chrysalis-only? Why is the DB query not indicated?

The ingestor script only asks SimBoard for enough existing ingestion state to decide which archive cases and their executions are new and may be candidates for ingestion. It does not fetch, return, or summarize the full database state. It also does not show the database query because the query is behind the SimBoard API, not inside the ingestor script. So this is not a Chrysalis-specific DB query in the ingestor. It is an API request filtered by the configured machine_name.

If more detail is needed, the API response or ingestor summary would need to be expanded to include counts like total known cases, known execution IDs, skipped known cases, and machine filter used.

Happy for you to open a new GitHub issue to expand logging in https://github.com/E3SM-Project/simboard/blob/main/backend/app/scripts/ingestion/nersc_archive_ingestor.py and https://github.com/E3SM-Project/simboard/blob/main/backend/app/scripts/ingestion/hpc_upload_archive_ingestor.py.

What is "candidate_logs_emitted/suppressed"?

What is candidate_logs_emitted/suppressed?

They are dry-run logging counters.

candidate_logs_emitted = how many candidate case details were actually printed to the log.

candidate_logs_suppressed = how many candidate case details were not printed because the script hit its logging limit.

The point is to avoid massive logs when many candidate cases are found. It does not change which cases are candidates or which cases would be ingested.

@tomvothecoder

Copy link
Copy Markdown
Collaborator Author

I suppose I should re-run the chrysalis test, using "OLD_PERF" as the root_PA directory. It is HUGE.

I don't think this is going to work yet as the directory structure of "OLD_PERF" is different from "performance_archive".
We need to expand ingestion support for "OLD_PERF" in #209.

We might also want to be targeted in what we ingest from "OLD_PERF". This will require guidance Rob/Jill.

@TonyB9000

Copy link
Copy Markdown
Collaborator

@tomvothecoder I could do the more thorough assessment, but "OLD_PERF" seems rather sparse. It contains a subdirecotry for every year and month, and each has "performance_archive" subdirectories (one per day, approx), but they appear to contain no "exec_id" material, only logs:

((test_simboard) ) (base) [ac.bartoletti1@chrlogin1 simboard]$ ll /lcrc/group/e3sm/OLD_PERF/2026-01
total 23
drwxrwsr-x 3 e3smtest E3SM 4096 Jan  1 00:21 performance_archive_anvil_e3sm_2026_01_01_00_19_55
drwxrwsr-x 2 e3smtest E3SM 4096 Jan  2 00:17 performance_archive_anvil_e3sm_2026_01_02_00_17_24
drwxrwsr-x 2 e3smtest E3SM 4096 Jan  3 00:17 performance_archive_anvil_e3sm_2026_01_03_00_17_43
drwxrwsr-x 2 e3smtest E3SM 4096 Jan  4 00:20 performance_archive_anvil_e3sm_2026_01_04_00_20_49
drwxrwsr-x 2 e3smtest E3SM 4096 Jan  5 00:23 performance_archive_anvil_e3sm_2026_01_05_00_22_56
drwxrwsr-x 2 e3smtest E3SM 4096 Jan  6 00:19 performance_archive_anvil_e3sm_2026_01_06_00_19_00
drwxrwsr-x 2 e3smtest E3SM 4096 Jan  7 00:18 performance_archive_anvil_e3sm_2026_01_07_00_17_47
drwxrwsr-x 3 e3smtest E3SM 4096 Jan  8 00:20 performance_archive_anvil_e3sm_2026_01_08_00_18_23
drwxrwsr-x 2 e3smtest E3SM 4096 Jan  9 00:17 performance_archive_anvil_e3sm_2026_01_09_00_17_32
drwxrwsr-x 2 e3smtest E3SM 4096 Jan 10 00:18 performance_archive_anvil_e3sm_2026_01_10_00_18_46
drwxrwsr-x 2 e3smtest E3SM 4096 Jan 11 00:19 performance_archive_anvil_e3sm_2026_01_11_00_19_34
drwxrwsr-x 2 e3smtest E3SM 4096 Jan 12 00:17 performance_archive_anvil_e3sm_2026_01_12_00_17_13
drwxrwsr-x 2 e3smtest E3SM 4096 Jan 21 05:17 performance_archive_anvil_e3sm_2026_01_21_05_17_46
drwxrwsr-x 3 e3smtest E3SM 4096 Jan 22 00:21 performance_archive_anvil_e3sm_2026_01_22_00_20_40
drwxrwsr-x 2 e3smtest E3SM 4096 Jan 23 00:22 performance_archive_anvil_e3sm_2026_01_23_00_22_20
drwxrwsr-x 3 e3smtest E3SM 4096 Jan 24 00:21 performance_archive_anvil_e3sm_2026_01_24_00_21_28
drwxrwsr-x 3 e3smtest E3SM 4096 Jan 25 00:17 performance_archive_anvil_e3sm_2026_01_25_00_17_10
drwxrwsr-x 3 e3smtest E3SM 4096 Jan 26 00:18 performance_archive_anvil_e3sm_2026_01_26_00_17_46
drwxrwsr-x 3 e3smtest E3SM 4096 Jan 27 00:23 performance_archive_anvil_e3sm_2026_01_27_00_22_54
drwxrwsr-x 3 e3smtest E3SM 4096 Jan 28 00:18 performance_archive_anvil_e3sm_2026_01_28_00_18_06
drwxrwsr-x 3 e3smtest E3SM 4096 Jan 29 00:19 performance_archive_anvil_e3sm_2026_01_29_00_18_23
drwxrwsr-x 2 e3smtest E3SM 4096 Jan 30 00:23 performance_archive_anvil_e3sm_2026_01_30_00_22_48
drwxrwsr-x 2 e3smtest E3SM 4096 Jan 31 00:17 performance_archive_anvil_e3sm_2026_01_31_00_16_50
((test_simboard) ) (base) [ac.bartoletti1@chrlogin1 simboard]$ ll /lcrc/group/e3sm/OLD_PERF/2026-01/performance_archive_anvil_e3sm_2026_01_30_00_22_48
total 32
-rw-rw-r-- 1 e3smtest E3SM 22748 Jan 30 00:22 e3sm_perf_archive_anvil_2026_01_30_00_22_48_out.txt
-rw-rw-r-- 1 e3smtest E3SM     0 Jan 30 00:21 large-files-removed.txt
((test_simboard) ) (base) [ac.bartoletti1@chrlogin1 simboard]$ ll /lcrc/group/e3sm/OLD_PERF/2026-01/performance_archive_anvil_e3sm_2026_01_06_00_19_00
total 32
-rw-rw-r-- 1 e3smtest E3SM 22748 Jan  6 00:19 e3sm_perf_archive_anvil_2026_01_06_00_19_00_out.txt
-rw-rw-r-- 1 e3smtest E3SM     0 Jan  6 00:17 large-files-removed.txt

@TonyB9000

Copy link
Copy Markdown
Collaborator

@tomvothecoder More importantly, I think we might want a "DRY_RUN_1" and "DRY_RUN_2", the latter wherein we exercise the "tar.gz" generation. (Even a "DRY_RUN_0" could stub the remote state test, and obviate the need to exercise the URL and API TOKEN, while still exercising the directory traversal and other "accept/reject" logic.)

As far as logging, it would be nice to simply know how many local entries were rejected due to redundancy (already accepted), or some confirmation that the remote state query had actually succeeded.

@tomvothecoder tomvothecoder added type: ops Operation and Deployment tasks for DOE sites. and removed type: enhancement New feature or request labels Jun 10, 2026
@tomvothecoder tomvothecoder changed the title Add Chrysalis ingestion wrapper [Ops]: Add Chrysalis ingestion wrapper Jun 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

type: ops Operation and Deployment tasks for DOE sites.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Ops]: Scope automated metadata ingestion at other sites (prioritize Chrysalis first)

2 participants