Add a schema registry by TeresiaOlsson · Pull Request #264 · python-accelerator-middle-layer/pyaml

TeresiaOlsson · 2026-05-20T08:50:13Z

This PR adds a schema registry that can be used for validation and generating json schemas for dynamic nested pydantic models.

Motivation:

Better separation of concerns. Validation of the configuration is separated into different classes than the ones responsible for storing and using the configuration, resulting in a separate validation layer.
It is possible to make validation option. The entire validation layer can be skipped if the user wished. For example, if they know that the input data has already been validated once and not changed.
More lose dependency of pydantic. Pydantic is only used at the edge of the core and not everywhere in line with what has been discussed for how to manage the dependency of pint.
Solving two problems with pydantic:
1. It is no good at handling nested models where the available types are only known at runtime.
2. It is no good at generating json schemas for arbitrary types and including all available subclasses of a model.

Features:

A schema registry which maps a class path to the schema that should be used for validation of the input data to the class.
A decorator to automatically register schemas in the schema registry.
A schema validator which validates nested dynamic models with the use of the registry.
A custom json schema generator which generates json schemas including all available subclasses with the use of the registry.

Major changes:

All ConfigModels have been separate from the domain classes and turned into schema classes which only has the purpose to define the schema used for validation. This is a major refactoring since the ConfigModels are spread out in the codebase. A temporary legacy handler had to be implemented to handle the packages for the bindings. All domain classes also had to be changed since they now need to explicitly declare their attributes.
A baseclass ConfigurationSchema has been added that all schemas must inherit from. This is to ensure that all schemas registered in the registry has the minimum required fields and same behaviour.
The yaml file format has been changed to instead of defining the type by the module it is defined by the class. This is to be compatible with external packages where a certain module structure can't be enforced and to allow several similar classes to be grouped in the same module. A temporary function to translate between yaml file formats have been added.

Example usage:
Examples of the usage is available in https://github.com/python-accelerator-middle-layer/pyaml/tree/schema-registry/examples/validation.

The json schema can be visualized and tested with MetaConfigurator to easier understand what it includes. You can import the schema directly from here: https://github.com/python-accelerator-middle-layer/pyaml/blob/schema-registry/pyaml/validation/schema.json

…ses.

…function.

…indings packages.

…literals.

JeanLucPons · 2026-05-28T12:29:26Z

I do not understand why some external modules appears in committed json files ?
These files are generated ?

TeresiaOlsson · 2026-05-28T12:37:10Z

I will explain it better on the meeting tomorrow. It's a temporary quite ugly fix I added to be able to register the schemas for the binding packages and convert the format of the yaml files. There is also one related ugly thing in the pyproject.toml at the moment. I didn't manage to figure out a less ugly way to handle it.

My plan is that all of that should go away when people have converted the yaml file format and I have updated the packages for the bindings to instead use entry points to find the schemas in there.

JeanLucPons · 2026-05-28T12:43:50Z

OK thanks

TeresiaOlsson · 2026-05-28T12:44:57Z

For the json schema I committed the full schema including the external packages to be able to provide a link which people can directly load in the metaconfigurator for testing without having to generate it themselves.

But I'm not sure if there should be a "standard" schema available in the repository in the future or not. I kind of like the idea of having a pregenerated schema including the most common external packages for new users to get started without having to generate the schema themselves. Then you only need to generate your own schema if you have facility specific packages. But if there should be such a basic, standard schema maybe it should be in it's own repository. Then we in that repository could also have different files for different parts of the schema if we want.

JeanLucPons · 2026-05-28T12:52:10Z

OK so it would be nice to provide a way to generate this schema with the modules you want to use.
Using catalog it should be possible to configure your backend using only simple strings.

TeresiaOlsson · 2026-05-28T13:03:34Z

That I think already works. As long as you have everything you want to include registered in the schema registry it will show up in the json schema. The key to it is to get external packages to automatically register in the schema registry and make that compatible with what the catalog needs. Those parts doesn't automatically work yet and might require some discussion to figure out the best way since I'm still working on understanding the details of the catalog to be able to make it compatible. But at least you can always manually register everything you want to include.

JeanLucPons · 2026-05-29T07:52:30Z

I'm starting to look at your implementation.
If i understand well your strategy, your plan is too remove pydantic validation from the Factory and let the factory expand dictionary to construct objects ?
If I'm right, please think how to handle error line number in validation. (see Factory.handle_validation_error).
(Your validation mechanism should handle __location__ and __fieldlocation__ private fields coming from the loader)
Thanks.

JeanLucPons · 2026-05-29T08:19:45Z

For me the purpose of Adapted_ControlSystem is a bit unclear and would it be possible to have one schema per magnet type to avoid those unwanted layers (item1->item1) ?

TeresiaOlsson · 2026-05-29T09:04:16Z

I'm starting to look at your implementation. If i understand well your strategy, your plan is too remove pydantic validation from the Factory and let the factory expand dictionary to construct objects ? If I'm right, please think how to handle error line number in validation. (see Factory.handle_validation_error). (Your validation mechanism should handle __location__ and __fieldlocation__ private fields coming from the loader) Thanks.

Yes, my plan is to make the validation a completely separate step done before the factory so the factory is only responsible for building objects. I'm aware of the error line number and also the function to reformat the validation errors from pydantic so they become easier to understand. I want to move that functionality into the SchemaValidator as part of refactoring the factory.

JeanLucPons · 2026-05-29T09:15:59Z

Yes, my plan is to make the validation a completely separate step done before the factory so the factory is only responsible for building objects. I'm aware of the error line number and also the function to reformat the validation errors from pydantic so they become easier to understand. I want to move that functionality into the SchemaValidator as part of refactoring the factory.

OK. The refurbishment of the factory should ne be a big deal. You only need to use the new class field to construct the object and expand the dictionary. What I dislike here is the duplication of the constructor signature (as discussed long time ago) which can be an error source difficult to debug. But no other way to allow construction by code without having to construct the schema first.

TeresiaOlsson · 2026-05-29T09:28:13Z

For me the purpose of Adapted_ControlSystem is a bit unclear and would it be possible to have one schema per magnet type to avoid those unwanted layers (item1->item1) ?

The Adapted_ControlSystem is part of the ugly fix to be able to include the ConfigModel from the binding packages despite that they don't follow the new format and inherit from ConfigurationSchema which is a requirement to be added into the SchemaRegistry. It will go away as soon as I refactor the schemas in those packages. They are just called Adapted_ControlSystem at the moment to indicate that it is part of the legacy fix.

For the magnets I also don't like the two layers of item1->item1 in the GUI but I found no way around it. Before there was a separate schema for each magnet type but it just meant I had to choose OctupoleSchema in the first box and still choose Octupole as class since the GUI doesn't prefill the class even if there is only one option. In the end I concluded that these are issues related to the metaconfigurator GUI which can be fixed in the future either by writing our own GUI application or make feature requests to the metaconfigurator. Instead I decided to follow the principle to only add a new schema if there is a field which makes it different. My hope is that eventually that would allow us to have oneOf instead of anyOf in the json schema which also would make the GUI work nicer. But turns out that oneOf is a very strict requirement and it isn't sufficient to just have two schemas with different fields if the fields that make them different are all optional. They need to have mandatory fields which are different and that might be a too strict requirement for us. But I'm still thinking about that.

TeresiaOlsson · 2026-05-29T09:51:31Z

Yes, my plan is to make the validation a completely separate step done before the factory so the factory is only responsible for building objects. I'm aware of the error line number and also the function to reformat the validation errors from pydantic so they become easier to understand. I want to move that functionality into the SchemaValidator as part of refactoring the factory.

OK. The refurbishment of the factory should ne be a big deal. You only need to use the new class field to construct the object and expand the dictionary. What I dislike here is the duplication of the constructor signature (as discussed long time ago) which can be an error source difficult to debug. But no other way to allow construction by code without having to construct the schema first.

I also disliked the duplication of the constructor signature initially. It was annoying to have to write the same thing twice. I tried some ideas to avoid it but found none that I liked and which was easier to use than copy-paste. But after a while I actually started to like it because it made it explicit what the attributes are for building the domain objects.

I found several cases where there were attributes in the ConfigModel which was never used for anything because they were hidden inside the _cfg and never exposed as public attributes in the domain class. Having to write all the attributes in the constructor forced me to think about which attributes should be private/public, which need properties because they should be read only or need some extra step when setting etc. For new users/developers I imagine that this is beneficial because it forces them to think about what the public interface of the class should be. Then when they are done figuring that out they can add the schema or register an already existing schema for their class.

I also liked how it forced me to think about the scope of the class. In a pydantic basemodel it is very easy to add 10 attributes but when you need to add all of them again to __init__ you start to realise that maybe they are too many. I think many attributes can be an indication of a class having a too large scope and that it should be separated into smaller classes to get a better separation of concerns.

JeanLucPons · 2026-05-29T10:28:15Z

My main problem is that this duplication can be an error source if you forget one parameter or enter them in a wrong order you may face tricky issue to solve. Duplication is generally not a good idea.
Today _cfg is supposed to be private which means that you are supposed to access it only from pyaml core. But _cfg should not be accessed by the user. I would prefer for dynamic configuration change to add some public methods. Anyway, in your schema, additional methods for changing the config will also be required.
i.e.

Today:

# Set up the  chromaticy monitor (override config settings)
CM = SR.get_chromaticity_monitor("CHROMATICITY_MONITOR")
CM._cfg.n_step = 3      # 3 point for chroma fit
CM._cfg.n_avg_meas = 1  # No averaging

should be:

# Set up the  chromaticy monitor (override config settings)
CM = SR.get_chromaticity_monitor("CHROMATICITY_MONITOR")
CM.set_n_step(3)      # 3 point for chroma fit
CM.set_n_avg_meas(1)  # No averaging

TeresiaOlsson · 2026-05-29T10:49:48Z

My main problem is that this duplication can be an error source if you forget one parameter or enter them in a wrong order you may face tricky issue to solve. Duplication is generally not a good idea. Today _cfg is supposed to be private which means that you are supposed to access it only from pyaml core. But _cfg should not be accessed by the user. I would prefer for dynamic configuration change to add some public methods. Anyway, in your schema, additional methods for changing the config will also be required. i.e.

Today:
# Set up the  chromaticy monitor (override config settings)
CM = SR.get_chromaticity_monitor("CHROMATICITY_MONITOR")
CM._cfg.n_step = 3      # 3 point for chroma fit
CM._cfg.n_avg_meas = 1  # No averaging
should be:
# Set up the  chromaticy monitor (override config settings)
CM = SR.get_chromaticity_monitor("CHROMATICITY_MONITOR")
CM.set_n_step(3)      # 3 point for chroma fit
CM.set_n_avg_meas(1)  # No averaging

I'm not sure I understand. In the schema there should be no method to change the configuration? The purpose of the schema classes is purely to define which fields are required for validation. You would then use the SchemaValidator to validate the data but the basemodel that is returned will not be used for anything else than extracting a dictionary with the validated data to be input to other classes. It is not meant to be stored. The parts which includes how to store and dynamically change the configuration is outside of the scope of this PR.

However, if people like the idea, it is possible to later make an implementation where dynamic changes of the configuration also calls the SchemaValidator to validate the data again before it is changed. I have started to think about that but the PR became too large so I decided to split it and make one first which is only focused on the validation step.

JeanLucPons · 2026-05-29T11:14:08Z

My remarks was rather link on the the consequence of your validation model (not on the schema itself) and the fact that fields have to be duplicated at the object level which means that you have in fact 3 duplications:

model schema
object constructor signature
object fields to access the configuration

In an ideal work, to follow pyaml coding style, i would like to be able to write:

# Set up the  chromaticy monitor (override config settings)
CM = SR.get_chromaticity_monitor("CHROMATICITY_MONITOR")
CM.n_step.set(3)      # 3 point for chroma fit
CM.n_avg_meas.set(1)  # No averaging

and having a mechanism (using a decorator or dynamic code generation) to map automatically schema fields to object getter(s)/setter(s) with an optional callback which allow the object to be informed of field updates. It will require at the schema level to be able to select if a field is R or RW.

TeresiaOlsson · 2026-05-29T12:16:26Z

My remarks was rather link on the the consequence of your validation model (not on the schema itself) and the fact that fields have to be duplicated at the object level which means that you have in fact 3 duplications:
* model schema

* object constructor signature

* object fields to access the configuration
In an ideal work, to follow pyaml coding style, i would like to be able to write:
# Set up the  chromaticy monitor (override config settings)
CM = SR.get_chromaticity_monitor("CHROMATICITY_MONITOR")
CM.n_step.set(3)      # 3 point for chroma fit
CM.n_avg_meas.set(1)  # No averaging
and having a mechanism (using a decorator or dynamic code generation) to map automatically schema fields to object getter(s)/setter(s) with an optional callback which allow the object to be informed of field updates. It will require at the schema level to be able to select if a field is R or RW.

If you want that I think it should be implemented in a way to make the object constructor signature being the source of truth. There is a way in pydantic to do that using TypeAdapter but when I tried it I didn't really like it. In the end my conclusion was that the duplication was the more robust way to implement it. Especially after I concluded that there are many classes which require the exact same input so there is no need to have separate schemas.

JeanLucPons · 2026-05-29T13:01:15Z

OK. For me it is a bit heavy to have all theses duplications and also loose the repr from pydantic.

For instance this will not work:

    arrays: list[ArraySchema] = Field(default_factory=list, repr=False)
    devices: list[ElementSchema] = Field(default_factory=list, repr=False)

Anyway, this refurbishment is heavy and we need to stabilize the implementation ASAP.

TeresiaOlsson · 2026-06-01T21:25:35Z

The only option I have found to avoid the duplication which I slightly believe in is discussed here: https://stackoverflow.com/questions/65888153/creating-a-pydantic-model-dynamically-from-a-python-dataclass

That would define dataclasses in the business logic which would be dynamically translated into Pydantic basemodels for the validation.

But in the end my conclusion after considering different options is the same as someone else also has in the comments. It's better to just write it twice because over time the internal model and the validation schema might start to differ.

I think we already have started to see that. Validation and generating the json schema doesn't work as well when arbitrary types are allowed, but it wouldn't make sense to restrict the domain classes to not take custom classes as input. For the majority of the schema classes I have introduced there is duplication of the attribute names but not of their type for exactly this reason. The validation schema and the internal model works better when they are different because they have different purposes and therefore different requirements.

TeresiaOlsson added 27 commits May 20, 2026 10:08

Change to schema instead of configmodel for quadrupole connected clas…

e50679c

…ses.

Refactor config models for arrays.

e51ae66

Refactor BPM config models.

a5116d1

Refactor config modules for curves and matrices.

a0c640e

Refactor config models for diagnostics.

a931493

Refactor config modules for RF.

afee235

Refactor config models for lattice.

ffa50cd

Refactoring of magnet config models.

af678a7

Modifications of lattice element linker config model.

ecdc505

Refactor tuning tools config models.

d73d4b6

Moved __pyaml_repr__ to a separate module since is a generic utility …

609ac1f

…function.

Cleanup of the element module.

575a421

Cleanup TuneMonitor.

ef19ade

Cleanup of attributes for RF.

02786ab

Refactor accelerator config model.

d1f3af8

Add schema for controlsystem.

f4e55aa

Add first implementation of schema registry.

5e4aaa6

First implementation of schema registry.

ef74bd5

Modifications to the schema registry to handle legacy yaml files.

b65922b

Add functionality to convert from old yaml file format to new one.

51cb4d8

Update the legacy registry.

ab6a632

Modifications to get validation to work including dependencies from b…

d8b80fa

…indings packages.

All BESSY2 examples can be validated.

8a06071

Starting to update tests to new format.

4832e61

Cleanup of the schema registry + add docstrings.

91ba0df

Started to added tests for the schema registry.

42205d1

Remove __setitem__ and make a separate update method instead.

4e7cac4

TeresiaOlsson marked this pull request as draft May 20, 2026 08:50

TeresiaOlsson changed the title ~~Add a Schema registry~~ Add a schema registry May 20, 2026

Change get method in schema registry to always return None if missing.

0bb0e36

TeresiaOlsson added 8 commits May 27, 2026 10:00

Fix wrong schema for Tango control system.

9a71b2a

Change configurationschemas from validation_alias to alias.

e3c4dd3

Change simple magnets to all use magnetschema.

2a265e3

Fix errors in SimulatorSchema.

be42520

Separate schemas for BPMSimpleModel and BPMTiltOffsetModel.

0d6a61b

Change so acceleratorschema has empty lists instead of None.

c7bb750

Changes to the schema generator to include possible class strings as …

32595b7

…literals.

Use anyOf instead of oneOf in all schemas.

001e065

JeanLucPons previously approved these changes May 28, 2026

View reviewed changes

TeresiaOlsson self-assigned this May 28, 2026

Add json schema.

5f7f5d2

TeresiaOlsson dismissed JeanLucPons’s stale review via 5f7f5d2 May 28, 2026 11:51

Conversation

TeresiaOlsson commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JeanLucPons commented May 28, 2026

Uh oh!

TeresiaOlsson commented May 28, 2026

Uh oh!

JeanLucPons commented May 28, 2026

Uh oh!

TeresiaOlsson commented May 28, 2026

Uh oh!

JeanLucPons commented May 28, 2026

Uh oh!

TeresiaOlsson commented May 28, 2026

Uh oh!

JeanLucPons commented May 29, 2026

Uh oh!

JeanLucPons commented May 29, 2026

Uh oh!

TeresiaOlsson commented May 29, 2026

Uh oh!

JeanLucPons commented May 29, 2026

Uh oh!

TeresiaOlsson commented May 29, 2026

Uh oh!

TeresiaOlsson commented May 29, 2026

Uh oh!

JeanLucPons commented May 29, 2026

Uh oh!

TeresiaOlsson commented May 29, 2026

Uh oh!

JeanLucPons commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TeresiaOlsson commented May 29, 2026

Uh oh!

JeanLucPons commented May 29, 2026

Uh oh!

TeresiaOlsson commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

TeresiaOlsson commented May 20, 2026 •

edited

Loading

JeanLucPons commented May 29, 2026 •

edited

Loading