Skip to content

Feat: Native Python FpML CodeList Validation and Resource Distribution (Resolves DSL I-150)#209

Open
jserrano-spec wants to merge 1 commit into
finos:mainfrom
jserrano-spec:feature/issue-150-codelist-loading
Open

Feat: Native Python FpML CodeList Validation and Resource Distribution (Resolves DSL I-150)#209
jserrano-spec wants to merge 1 commit into
finos:mainfrom
jserrano-spec:feature/issue-150-codelist-loading

Conversation

@jserrano-spec
Copy link
Copy Markdown

Hi @dschwartznyc , @plamen-neykov,

Following up on your request in CDM I-3904 to consolidate our approach, our team decided to contribute our internal sandbox implementation directly to the test suite to establish a scalable pattern for FpML CodeList validation in Python.

1. What This PR Solves

This PR provides a complete, end-to-end solution for Issue DSL I-150 by enabling FpML code list validation natively in Python for in-memory objects. It achieves this by:

  • Establishing a clean extension directory: Creating src/main/python/runtime_extensions to house custom native Python implementations.
  • Native Bundling: Bundling the required JSON codelist resources and the Python runtime extensions directly into the generated finos-cdm wheel artifact.
  • Custom Data Loader: Providing a loader (load_codelist.py) that uses importlib and @lru_cache to efficiently read and deserialize those JSON files at runtime.
  • Runtime Initialization: Introducing a Python runtime initializer (cdm_runtime.py) that acts similarly to Java's CdmRuntimeModule. This seamlessly injects our custom loader into the natively generated LoadCodeList API using the assign binding mechanism.

2. How to Validate This Implementation

While a standard mvn clean install will build the Java generator, it does not execute the Python-specific test suite. To test this implementation locally and execute test_codelist_validation.py, you must use the Python test runner scripts.

Step 1: Clean Stale Artifacts (Important)

To ensure the script uses the latest compiled Java generator rather than a cached version, delete any existing snapshot JAR first:

rm target/python-0.0.0.main-SNAPSHOT.jar

Step 2: Initial Build & Test

Run the CDM test script and use the -k flag to keep the virtual environment alive after the tests finish. (The script will automatically rebuild the Java JAR via maven, generate the Python code, bundle the JSONs and runtime extensions, and install the wheel into a fresh .pyenv).

./python-test/cdm-tests/run_cdm_tests.sh -k

Step 3: Fast Iteration (Direct Pytest Execution)

Once the initial build from Step 2 completes, you do not need to use the bash scripts again. For fast test execution, simply activate the virtual environment and run pytest directly:

# Unix / Bash
source .pyenv/bin/activate
python -m pytest python-test/cdm-tests/test_codelist_validation.py

# Windows / Powershell
. .\.pyenv\bin\Activate.ps1 
python -m pytest python-test\cdm-tests\test_codelist_validation.py

3. Architecture & Data Flow Reference

Below is the consolidated data flow requested in CDM I-3904 to provide a comprehensive view of how this implementation works without modifying the core generator logic:

  • CDM Python Build Process & Distribution: We modified get_cdm.sh to pull rosetta-source/src/main/resources/codelist/json/* via sparse checkout. We then modified build_cdm.sh to copy these JSON files into src/finos/resources and copy the Python extensions into src/finos/runtime_extensions. Finally, we dynamically inject a MANIFEST.in configuration. This instructs the pip wheel builder to package these assets natively inside the .whl artifact.

  • Data Loading & Runtime Integration: Our custom loader (load_codelist.py) utilizes Python's importlib.resources to locate the JSON files inside the installed package. It uses a regex match on the requested domain parameter to find the correct file, reads the payload, and deserializes it using CodeList.rune_deserialize. To prevent redundant disk I/O, it is wrapped in an @lru_cache.

  • Generator & Validation Flow: We did not need to make any changes to the rune-python-generator. Because the logic for ValidateFpMLCodingSchemeDomain is now natively generated, our bootstrap script (finos.runtime_extensions.cdm_runtime) simply binds our loader to the LoadCodeList proxy API. The natively generated validation function executes flawlessly behind the scenes.

  • CDM's Serialization Format: Our current contribution is strictly focused on enabling runtime data validation for in-memory objects. Consequently, we have not explored modifications to the CDM serialization format to append external list version references when saving documents. We consider this outside the scope of this initial runtime-loading implementation.

This commit provides a complete, end-to-end solution for natively
validating FpML FpMLCodingSchemeDomain objects in Python by bundling
required JSON resources and injecting custom runtime extensions.

* Build & Distribution: Updated `get_cdm.sh` to fetch JSON codelists
  via sparse checkout. Updated `build_cdm.sh` to package these JSON
  files and Python extensions into the `finos-cdm` wheel artifact
  using dynamic MANIFEST.in injection.
* Runtime Extensions: Established `src/main/python/runtime-extensions/`
  to cleanly separate custom Python logic from the Java generator source.
* Data Loader: Created `load_codelist.py` leveraging `importlib.resources`
  and `@lru_cache` for dynamic, disk-efficient JSON deserialization.
* Bootstrap Injection: Introduced `cdm_runtime.py` to act as the Python
  Dependency Injection container, automatically binding the custom loader
  to the generated `LoadCodeList` facade.
* Testing: Added `test_codelist_validation.py` to verify the pipeline
  end-to-end, and updated `.gitignore` for the new resource paths.

Resolves: finos#150
@linux-foundation-easycla
Copy link
Copy Markdown

linux-foundation-easycla Bot commented May 22, 2026

CLA Signed
The committers listed above are authorized under a signed CLA.

  • ✅ login: jserrano-spec / name: Javier Serrano Palazon (496235d)

@jserrano-spec jserrano-spec changed the title Feat: Native Python FpML CodeList Validation and Resource Distribution (Resolves [DSL I-150](https://github.com/finos/rune-python-generator/issues/150)) Feat: Native Python FpML CodeList Validation and Resource Distribution (Resolves DSL I-150) May 22, 2026

# Inject LoadCodeList
if hasattr(cdm_load_codelist_facade, "LoadCodeList"):
cdm_load_codelist_facade.LoadCodeList.__assign__(impl_load_codelist) # type: ignore
Copy link
Copy Markdown
Contributor

@plamen-neykov plamen-neykov May 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just use

cdm_load_codelist_facade.LoadCodeList = impl_load_codelist

And also replace the if with try/except.

Copy link
Copy Markdown
Author

@jserrano-spec jserrano-spec May 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your suggestion upon the use of try/except instead of hasattr.

Regarding using the standard assignment operator (=), I actually tried that first, but it causes the validation tests to fail with a NotImplementedError (I just re-verified it locally - you will find the error log attached at the end of this answer).

From my debugging, this happens because LoadCodeList is generated as a standalone file, and the Java generator currently only appends the create_module_attr_guardian to _bundle.py. Because the standalone file lacks the module guardian, the = operator just overwrites the attribute in the module's local namespace instead of triggering __assign__ on the underlying FuncProxy.

Since ValidateFpMLCodingSchemeDomain imports it early via from ... import LoadCodeList, it ends up holding onto the original, empty proxy object. Calling __assign__ directly bypasses this by forcing the existing proxy to mutate its internal state, which makes the early-bound references work perfectly.

I can update my code to use the try/except block with __assign__ for now. Let me know if you'd prefer to handle this differently, or if there's a plan to add the module guardian to standalone generated files!

___________________________________________ test_valid_business_center_code ____________________________________________

    def test_valid_business_center_code():
>       is_valid = cdm_api.ValidateFpMLCodingSchemeDomain("GBLO", "business-center")
        
# ... [internal framework proxy/validation trace] ...

    def rune_execute_native(function_name: str, *args, **kwargs) -> Any:
        if function := _NATIVE_REGISTRY.get(function_name):
            return function(*args, **kwargs)
        available = ", ".join(sorted(_NATIVE_REGISTRY))
>       raise NotImplementedError(
            f"Function {function_name} doesn't have an implementation! "
            f"Available: {available or '<none>'}"
        )
E       NotImplementedError: Function cdm.base.staticdata.codelist.functions.LoadCodeList doesn't have an implementation! Available: <none>

# Add the LRU Cache.
# maxsize=15 means it will remember the last 128 requested domains in memory, evicting the least recently used (lcu) ones when the limit is exceeded.
@lru_cache(maxsize=15)
def load_codelist(domain: str) -> CodeList:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function has to be in the runtime, otherwise codelists can't be used in any other model but cdm.

Copy link
Copy Markdown
Author

@jserrano-spec jserrano-spec May 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I completely agree with this architectural goal—this utility should definitely be generic so other models can use it. To make it truly generic, I'll also need to parameterize the deserialization step (since it currently hardcodes finos.cdm.base...CodeList).

Question on logistics: Because this requires modifying both repositories, how would you like to coordinate this? Should I open a separate PR in rune-python-runtime first to add the generic utility, wait for that to be merged/released, and then update this PR to consume it?

logger.setLevel(logging.INFO)

# Dynamically locate the 'finos' CDM package in the environment. Then navigate to the nested resources folder
codelist_dir = resource_loader.files("finos").joinpath("resources", "codelist", "json")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this a parameter of the load_codelists and create a partial during the runtime initialisation

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great call. Hardcoding "finos" inside the loader limits its reusability. I will refactor load_codelist to accept codelist_dir as an argument, and update cdm_runtime.py to use functools.partial to inject the specific path during bootstrap. Thanks for you feedback.

Copy link
Copy Markdown
Author

@jserrano-spec jserrano-spec May 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi again, @plamen-neykov ,

I really liked the idea of using functools.partial, but I ran into a strictness issue with FuncProxy when I implemented it.

Because functools.partial does not remove bound arguments from the inspect.signature object (it just assigns them a default value), the signature of the partial object evaluates to (domain, *, codelist_dir=...).

When __assign__ runs, it hits func_proxy.py line 81:

curr_params = inspect.signature(self._func).parameters
new_params = inspect.signature(func).parameters
if curr_params.keys() != new_params.keys(): # This line
    raise ValueError(
        'Replacement function parameter list do not match the current '
        f'parameter list of {str(self._func)}'
    )

Because ['domain'] != ['domain', 'codelist_dir'], the framework rejects the assignment with a ValueError.

To ensure FuncProxy gets the exact ['domain'] signature it expects while we wait to move the deserializer into the runtime, I used a standard closure (wrapper function) in cdm_runtime.py to bind the directory instead:

# Dynamically load resources inside CDM package in the environment
codelist_dir = resource_loader.files("finos").joinpath("resources", "codelist", "json")

# Partial approach
# from functools import partial
# bound_load_codelist = partial(impl_load_codelist, codelist_dir=codelist_dir)

# Use a standard wrapper instead of 'partial' so the signature is perfectly clean for FuncProxy
def bound_load_codelist(domain: str):
    return impl_load_codelist(domain, codelist_dir)
    
# Inject custom implementations into the CDM Facades. This is the Python equivalent of Guice bindings in Java.
try:
    cdm_load_codelist_facade.LoadCodeList.__assign__(bound_load_codelist) # type: ignore
    logger.debug("Successfully bound LoadCodeList.")
except AttributeError:
    logger.error("Bootstrap Failed: Could not find LoadCodeList in CDM Facade.")

Let me know if this works for you, or if you'd prefer I open a PR in rune-python-runtime to make FuncProxy compatible with partial objects!

)

logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not do that! The user should be selecting the logging level.

Copy link
Copy Markdown
Author

@jserrano-spec jserrano-spec May 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are absolutely right, library code shouldn't hijack the application's logging configuration! I will remove logging.basicConfig() and setLevel() so the end-user retains full control over their log output. I'll include this fix in the next commit.


# Copy CDM resources to the target 'common-domain-model/resources' folder
mkdir -p "${RESOURCES_DIR}/common-domain-model/codelist/json"
cp -r rosetta-source/src/main/resources/codelist/json/* "${RESOURCES_DIR}/common-domain-model/codelist/json/"
Copy link
Copy Markdown
Contributor

@plamen-neykov plamen-neykov May 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These won't be packaged in the wheel, so a cdm package installed with pip install won't support codelists.

Copy link
Copy Markdown
Author

@jserrano-spec jserrano-spec May 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, they do get packaged into the wheel!

While get_cdm.sh just downloads them locally to the workspace, the actual packaging happens over in build_cdm.sh. If you look at lines 108-114 in build_cdm.sh in this PR, I added a step that dynamically writes a MANIFEST.in file containing recursive-include src/finos/resources * right before the wheel is built.

This successfully forces pip wheel to bundle the JSON files natively inside the .whl artifact, which is why importlib.resources is able to find them at runtime. Let me know if you'd prefer a different packaging mechanism, but this approach has been working perfectly in our sandbox!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jserrano-spec A few suggestions that would help us more effectively review the PR.

Start with a design document. Although there's an outline, the details of your approach is still unclear. Please clarify the data flows in a design document that we can review and comment on before further code changes. That design should align to the three main pieces involved in creating a Python version of CDM:

  • the Runtime, which provides core Python functionality used by generated code
  • the Generator, which interprets a Rune model and emits Python that depends on the Runtime
  • the CDM build and deployment process, which uses the Generator to produce Python from CDM's data and functional model

Completeness. Your proposal should call out the impact on all three components and describe how the changes together realize the goals set out in the issue. The proposal should also address the central requirement that code lists can be independently updated without requiring the creation of a new version of the CDM Python artifact.

Generic Technology

  • Changes to the Runtime and Generator need to reflect that both are generic technology that can be applied to any Rune codebase.
  • Include generic tests: Point Feature/switch implementation #4 of the contribution process requires that any contribution include tests confirming the change works as expected. These become part of the test suite that runs on every PR, which protects the rest of the codebase from regressions. For reference, the CDM tests live in the repo to support development but are not part of this suite. The proposed change must therefore include generic (non-CDM) tests that demonstrate it works as expected.

Cover both success and failure paths. As a rule, the design should address both the success path ("on flow") and failure paths ("off flow"). For example, what happens if the JSON code lists aren't found when the Python library loads?

Use a feature branch for work in progress. PRs against main are expected to fully implement the proposed change, including the tests mentioned above. For works in progress, please use a feature branch and feel free to solicit input from the maintainers along the way. That keeps the eventual review against main focused on a complete, tested change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants