Skip to content

Add a schema registry#264

Draft
TeresiaOlsson wants to merge 52 commits into
mainfrom
schema-registry
Draft

Add a schema registry#264
TeresiaOlsson wants to merge 52 commits into
mainfrom
schema-registry

Conversation

@TeresiaOlsson
Copy link
Copy Markdown
Contributor

@TeresiaOlsson TeresiaOlsson commented May 20, 2026

This PR adds a schema registry that can be used for validation and generating json schemas for dynamic nested pydantic models.

Motivation:

  • Better separation of concerns. Validation of the configuration is separated into different classes than the ones responsible for storing and using the configuration, resulting in a separate validation layer.

  • It is possible to make validation option. The entire validation layer can be skipped if the user wished. For example, if they know that the input data has already been validated once and not changed.

  • More lose dependency of pydantic. Pydantic is only used at the edge of the core and not everywhere in line with what has been discussed for how to manage the dependency of pint.

  • Solving two problems with pydantic:

    1. It is no good at handling nested models where the available types are only known at runtime.
    2. It is no good at generating json schemas for arbitrary types and including all available subclasses of a model.

Features:

  • A schema registry which maps a class path to the schema that should be used for validation of the input data to the class.

  • A decorator to automatically register schemas in the schema registry.

  • A schema validator which validates nested dynamic models with the use of the registry.

  • A custom json schema generator which generates json schemas including all available subclasses with the use of the registry.

Major changes:

  • All ConfigModels have been separate from the domain classes and turned into schema classes which only has the purpose to define the schema used for validation. This is a major refactoring since the ConfigModels are spread out in the codebase. A temporary legacy handler had to be implemented to handle the packages for the bindings. All domain classes also had to be changed since they now need to explicitly declare their attributes.

  • A baseclass ConfigurationSchema has been added that all schemas must inherit from. This is to ensure that all schemas registered in the registry has the minimum required fields and same behaviour.

  • The yaml file format has been changed to instead of defining the type by the module it is defined by the class. This is to be compatible with external packages where a certain module structure can't be enforced and to allow several similar classes to be grouped in the same module. A temporary function to translate between yaml file formats have been added.

Example usage:
Examples of the usage is available in https://github.com/python-accelerator-middle-layer/pyaml/tree/schema-registry/examples/validation.

The json schema can be visualized and tested with MetaConfigurator to easier understand what it includes. You can import the schema directly from here: https://github.com/python-accelerator-middle-layer/pyaml/blob/schema-registry/pyaml/validation/schema.json

@TeresiaOlsson TeresiaOlsson marked this pull request as draft May 20, 2026 08:50
@TeresiaOlsson TeresiaOlsson changed the title Add a Schema registry Add a schema registry May 20, 2026
JeanLucPons
JeanLucPons previously approved these changes May 28, 2026
@TeresiaOlsson TeresiaOlsson self-assigned this May 28, 2026
@JeanLucPons
Copy link
Copy Markdown
Contributor

I do not understand why some external modules appears in committed json files ?
These files are generated ?

@TeresiaOlsson
Copy link
Copy Markdown
Contributor Author

I will explain it better on the meeting tomorrow. It's a temporary quite ugly fix I added to be able to register the schemas for the binding packages and convert the format of the yaml files. There is also one related ugly thing in the pyproject.toml at the moment. I didn't manage to figure out a less ugly way to handle it.

My plan is that all of that should go away when people have converted the yaml file format and I have updated the packages for the bindings to instead use entry points to find the schemas in there.

@JeanLucPons
Copy link
Copy Markdown
Contributor

OK thanks

@TeresiaOlsson
Copy link
Copy Markdown
Contributor Author

For the json schema I committed the full schema including the external packages to be able to provide a link which people can directly load in the metaconfigurator for testing without having to generate it themselves.

But I'm not sure if there should be a "standard" schema available in the repository in the future or not. I kind of like the idea of having a pregenerated schema including the most common external packages for new users to get started without having to generate the schema themselves. Then you only need to generate your own schema if you have facility specific packages. But if there should be such a basic, standard schema maybe it should be in it's own repository. Then we in that repository could also have different files for different parts of the schema if we want.

@JeanLucPons
Copy link
Copy Markdown
Contributor

OK so it would be nice to provide a way to generate this schema with the modules you want to use.
Using catalog it should be possible to configure your backend using only simple strings.

@TeresiaOlsson
Copy link
Copy Markdown
Contributor Author

That I think already works. As long as you have everything you want to include registered in the schema registry it will show up in the json schema. The key to it is to get external packages to automatically register in the schema registry and make that compatible with what the catalog needs. Those parts doesn't automatically work yet and might require some discussion to figure out the best way since I'm still working on understanding the details of the catalog to be able to make it compatible. But at least you can always manually register everything you want to include.

@JeanLucPons
Copy link
Copy Markdown
Contributor

I'm starting to look at your implementation.
If i understand well your strategy, your plan is too remove pydantic validation from the Factory and let the factory expand dictionary to construct objects ?
If I'm right, please think how to handle error line number in validation. (see Factory.handle_validation_error).
(Your validation mechanism should handle __location__ and __fieldlocation__ private fields coming from the loader)
Thanks.

@JeanLucPons
Copy link
Copy Markdown
Contributor

For me the purpose of Adapted_ControlSystem is a bit unclear and would it be possible to have one schema per magnet type to avoid those unwanted layers (item1->item1) ?

image

@TeresiaOlsson
Copy link
Copy Markdown
Contributor Author

I'm starting to look at your implementation. If i understand well your strategy, your plan is too remove pydantic validation from the Factory and let the factory expand dictionary to construct objects ? If I'm right, please think how to handle error line number in validation. (see Factory.handle_validation_error). (Your validation mechanism should handle __location__ and __fieldlocation__ private fields coming from the loader) Thanks.

Yes, my plan is to make the validation a completely separate step done before the factory so the factory is only responsible for building objects. I'm aware of the error line number and also the function to reformat the validation errors from pydantic so they become easier to understand. I want to move that functionality into the SchemaValidator as part of refactoring the factory.

@JeanLucPons
Copy link
Copy Markdown
Contributor

Yes, my plan is to make the validation a completely separate step done before the factory so the factory is only responsible for building objects. I'm aware of the error line number and also the function to reformat the validation errors from pydantic so they become easier to understand. I want to move that functionality into the SchemaValidator as part of refactoring the factory.

OK. The refurbishment of the factory should ne be a big deal. You only need to use the new class field to construct the object and expand the dictionary. What I dislike here is the duplication of the constructor signature (as discussed long time ago) which can be an error source difficult to debug. But no other way to allow construction by code without having to construct the schema first.

@TeresiaOlsson
Copy link
Copy Markdown
Contributor Author

For me the purpose of Adapted_ControlSystem is a bit unclear and would it be possible to have one schema per magnet type to avoid those unwanted layers (item1->item1) ?
image

The Adapted_ControlSystem is part of the ugly fix to be able to include the ConfigModel from the binding packages despite that they don't follow the new format and inherit from ConfigurationSchema which is a requirement to be added into the SchemaRegistry. It will go away as soon as I refactor the schemas in those packages. They are just called Adapted_ControlSystem at the moment to indicate that it is part of the legacy fix.

For the magnets I also don't like the two layers of item1->item1 in the GUI but I found no way around it. Before there was a separate schema for each magnet type but it just meant I had to choose OctupoleSchema in the first box and still choose Octupole as class since the GUI doesn't prefill the class even if there is only one option. In the end I concluded that these are issues related to the metaconfigurator GUI which can be fixed in the future either by writing our own GUI application or make feature requests to the metaconfigurator. Instead I decided to follow the principle to only add a new schema if there is a field which makes it different. My hope is that eventually that would allow us to have oneOf instead of anyOf in the json schema which also would make the GUI work nicer. But turns out that oneOf is a very strict requirement and it isn't sufficient to just have two schemas with different fields if the fields that make them different are all optional. They need to have mandatory fields which are different and that might be a too strict requirement for us. But I'm still thinking about that.

@TeresiaOlsson
Copy link
Copy Markdown
Contributor Author

Yes, my plan is to make the validation a completely separate step done before the factory so the factory is only responsible for building objects. I'm aware of the error line number and also the function to reformat the validation errors from pydantic so they become easier to understand. I want to move that functionality into the SchemaValidator as part of refactoring the factory.

OK. The refurbishment of the factory should ne be a big deal. You only need to use the new class field to construct the object and expand the dictionary. What I dislike here is the duplication of the constructor signature (as discussed long time ago) which can be an error source difficult to debug. But no other way to allow construction by code without having to construct the schema first.

I also disliked the duplication of the constructor signature initially. It was annoying to have to write the same thing twice. I tried some ideas to avoid it but found none that I liked and which was easier to use than copy-paste. But after a while I actually started to like it because it made it explicit what the attributes are for building the domain objects.

I found several cases where there were attributes in the ConfigModel which was never used for anything because they were hidden inside the _cfg and never exposed as public attributes in the domain class. Having to write all the attributes in the constructor forced me to think about which attributes should be private/public, which need properties because they should be read only or need some extra step when setting etc. For new users/developers I imagine that this is beneficial because it forces them to think about what the public interface of the class should be. Then when they are done figuring that out they can add the schema or register an already existing schema for their class.

I also liked how it forced me to think about the scope of the class. In a pydantic basemodel it is very easy to add 10 attributes but when you need to add all of them again to __init__ you start to realise that maybe they are too many. I think many attributes can be an indication of a class having a too large scope and that it should be separated into smaller classes to get a better separation of concerns.

@JeanLucPons
Copy link
Copy Markdown
Contributor

My main problem is that this duplication can be an error source if you forget one parameter or enter them in a wrong order you may face tricky issue to solve. Duplication is generally not a good idea.
Today _cfg is supposed to be private which means that you are supposed to access it only from pyaml core. But _cfg should not be accessed by the user. I would prefer for dynamic configuration change to add some public methods. Anyway, in your schema, additional methods for changing the config will also be required.
i.e.

Today:

# Set up the  chromaticy monitor (override config settings)
CM = SR.get_chromaticity_monitor("CHROMATICITY_MONITOR")
CM._cfg.n_step = 3      # 3 point for chroma fit
CM._cfg.n_avg_meas = 1  # No averaging

should be:

# Set up the  chromaticy monitor (override config settings)
CM = SR.get_chromaticity_monitor("CHROMATICITY_MONITOR")
CM.set_n_step(3)      # 3 point for chroma fit
CM.set_n_avg_meas(1)  # No averaging

@TeresiaOlsson
Copy link
Copy Markdown
Contributor Author

My main problem is that this duplication can be an error source if you forget one parameter or enter them in a wrong order you may face tricky issue to solve. Duplication is generally not a good idea. Today _cfg is supposed to be private which means that you are supposed to access it only from pyaml core. But _cfg should not be accessed by the user. I would prefer for dynamic configuration change to add some public methods. Anyway, in your schema, additional methods for changing the config will also be required. i.e.

Today:

# Set up the  chromaticy monitor (override config settings)
CM = SR.get_chromaticity_monitor("CHROMATICITY_MONITOR")
CM._cfg.n_step = 3      # 3 point for chroma fit
CM._cfg.n_avg_meas = 1  # No averaging

should be:

# Set up the  chromaticy monitor (override config settings)
CM = SR.get_chromaticity_monitor("CHROMATICITY_MONITOR")
CM.set_n_step(3)      # 3 point for chroma fit
CM.set_n_avg_meas(1)  # No averaging

I'm not sure I understand. In the schema there should be no method to change the configuration? The purpose of the schema classes is purely to define which fields are required for validation. You would then use the SchemaValidator to validate the data but the basemodel that is returned will not be used for anything else than extracting a dictionary with the validated data to be input to other classes. It is not meant to be stored. The parts which includes how to store and dynamically change the configuration is outside of the scope of this PR.

However, if people like the idea, it is possible to later make an implementation where dynamic changes of the configuration also calls the SchemaValidator to validate the data again before it is changed. I have started to think about that but the PR became too large so I decided to split it and make one first which is only focused on the validation step.

@JeanLucPons
Copy link
Copy Markdown
Contributor

JeanLucPons commented May 29, 2026

My remarks was rather link on the the consequence of your validation model (not on the schema itself) and the fact that fields have to be duplicated at the object level which means that you have in fact 3 duplications:

  • model schema
  • object constructor signature
  • object fields to access the configuration

In an ideal work, to follow pyaml coding style, i would like to be able to write:

# Set up the  chromaticy monitor (override config settings)
CM = SR.get_chromaticity_monitor("CHROMATICITY_MONITOR")
CM.n_step.set(3)      # 3 point for chroma fit
CM.n_avg_meas.set(1)  # No averaging

and having a mechanism (using a decorator or dynamic code generation) to map automatically schema fields to object getter(s)/setter(s) with an optional callback which allow the object to be informed of field updates. It will require at the schema level to be able to select if a field is R or RW.

@TeresiaOlsson
Copy link
Copy Markdown
Contributor Author

My remarks was rather link on the the consequence of your validation model (not on the schema itself) and the fact that fields have to be duplicated at the object level which means that you have in fact 3 duplications:

* model schema

* object constructor signature

* object fields to access the configuration

In an ideal work, to follow pyaml coding style, i would like to be able to write:

# Set up the  chromaticy monitor (override config settings)
CM = SR.get_chromaticity_monitor("CHROMATICITY_MONITOR")
CM.n_step.set(3)      # 3 point for chroma fit
CM.n_avg_meas.set(1)  # No averaging

and having a mechanism (using a decorator or dynamic code generation) to map automatically schema fields to object getter(s)/setter(s) with an optional callback which allow the object to be informed of field updates. It will require at the schema level to be able to select if a field is R or RW.

If you want that I think it should be implemented in a way to make the object constructor signature being the source of truth. There is a way in pydantic to do that using TypeAdapter but when I tried it I didn't really like it. In the end my conclusion was that the duplication was the more robust way to implement it. Especially after I concluded that there are many classes which require the exact same input so there is no need to have separate schemas.

@JeanLucPons
Copy link
Copy Markdown
Contributor

OK. For me it is a bit heavy to have all theses duplications and also loose the repr from pydantic.

For instance this will not work:

    arrays: list[ArraySchema] = Field(default_factory=list, repr=False)
    devices: list[ElementSchema] = Field(default_factory=list, repr=False)

Anyway, this refurbishment is heavy and we need to stabilize the implementation ASAP.

@TeresiaOlsson
Copy link
Copy Markdown
Contributor Author

The only option I have found to avoid the duplication which I slightly believe in is discussed here: https://stackoverflow.com/questions/65888153/creating-a-pydantic-model-dynamically-from-a-python-dataclass

That would define dataclasses in the business logic which would be dynamically translated into Pydantic basemodels for the validation.

But in the end my conclusion after considering different options is the same as someone else also has in the comments. It's better to just write it twice because over time the internal model and the validation schema might start to differ.

I think we already have started to see that. Validation and generating the json schema doesn't work as well when arbitrary types are allowed, but it wouldn't make sense to restrict the domain classes to not take custom classes as input. For the majority of the schema classes I have introduced there is duplication of the attribute names but not of their type for exactly this reason. The validation schema and the internal model works better when they are different because they have different purposes and therefore different requirements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants