[midPoint] Blog: A Road To Axiom

Wed May 13 16:51:23 CEST 2020

Dear midPoint community,

MidPoint is a fully schema-aware system. MidPoint eats and breaths the 
schema from the very bottom to the very top. Therefore we need a 
language to express the schema. MidPoint was built on XML Schema 
Definition (XSD) and we have lived in that uneasy relationship for 
years. But now it is the right time to make big step forward.

The concept of schema completely permeates midPoint. You cannot really 
do anything with midPoint without dealing with schema, directly or 
indirectly. Connectors represent attribute names and types using schema. 
That schema is used by midPoint mappings to correctly convert data 
types. The schema is used by user interface to automatically create 
correct input fields for data. Schema is used to customize and extend 
midPoint data model. Schema is everywhere. This is one of fundamental 
principles of midPoint. It lowers deployment effort, it makes 
customization easier and it provides some guarantees about correctness 
of the configuration.

MidPoint project started in 2011, but some parts of midPoint design go 
back even further. XML Schema Definition (XSD) was an obvious choice for 
schema definition language at that time. We were not happy with XSD and 
the XML ecosystem even at that early time, but there was nothing better 
we could use. MidPoint has evolved during all these years. XML is no 
longer the only data language we support, there is also JSON and YAML. 
But XSD remained as a schema language to this day. We considered using 
JSON Schema instead, but it does not provide any significant advantage 
over XSD. In fact, we considered several schema languages 
<https://docs.evolveum.com/midpoint/midprivacy/phases/01-data-provenance-prototype/existing-languages-analysis/> 
at several points in midPoint development process. But the result was 
always the same: there is no schema language really suiting our needs. 
Switching from XSD to any other existing language would mean that we 
have to do a lot of work to get to the same place where we already are.

The problem with XML schema is that it describes XML data structures. 
The problem with JSON schema is that it describes JSON data structures. 
These languages are designed to describe data represented in a very 
specific format. We need something else. We need way how to describe 
data structures that can be used in a wide variety of ways: data in JSON 
file, data in relational database tables, data provided by a RESTful 
interface, data displayed in user interface and so on. This may seem 
easy, but the devil is in the details. E.g XML has namespaces, JSON does 
not (unless it is JSON-LD which kind of has namespaces). XML has 
attributes, JSON does not. JSON and XML assume ordering in multivalue 
data, but such assumption is a problem when data are stored in 
relational database or LDAP. XML has XPath which is an overkill and 
JSONPath is pretty much the same. It is all one big mess. One can 
survive in this world by making a lot of compromises and violating a 
couple of standards. That is what we have done with XSD and it kind of 
worked. We have been (ab)using XSD for the purpose of data modelling for 
many years. But we got to know all the problems quite intimately. Nobody 
can say that we have not tried hard enough 
<https://docs.evolveum.com/midpoint/midprivacy/phases/01-data-provenance-prototype/xsd-keywords-use/>. 
What is even worse, JSON Schema, YANG or SCIM schema are built on the 
same principles as XSD and therefore they are not going to solve the 
fundamental issues either.

What we need to do is to go one level of abstraction up. We do not want 
to model XML or JSON data. We want to model /data/, regardless of their 
actual representation or storage mechanism. That was quite clear as 
early as in 2012 when we designed Prism 
<https://wiki.evolveum.com/display/midPoint/Prism+Objects> as an 
abstraction layer in midPoint code. Prism was used to model the /data/, 
not just their XML representation. That decision allowed us to implement 
JSON and YAML support in midPoint in quite an elegant way. Prism has 
evolved during all these years, but it was always limited in its 
capabilities. And XSD played a significant part in these limitations. We 
planned for years that we have to do something about it. But solving 
this problem properly is not an easy task. And we always managed to push 
XSD a bit further, to make it play one more dirty trick. This worked for 
more than 6 years.

Enter midPrivacy <https://docs.evolveum.com/midpoint/midprivacy/>. We 
have been working on data protection features 
<https://evolveum.com/introducing-midprivacy-initiative/> for quite some 
time. But it was 2019 when we got our chance to take it to the next 
level. NGI <https://www.ngi.eu/> has an NGI_TRUST 
<https://www.ngi.eu/ngi-projects/ngi-trust/> project that looked like a 
perfect opportunity for us. We were more than aware that data protection 
is as much about /meta-data/ as it is about data. You can make proper 
use of the data only if you know how reliable the data are, where they 
come from and whether you are entitled to use them at all. Meta-data 
capability is basic building block for pretty much any data protection 
platform. It provides visibility and accountability. Obviously, we 
needed that in midPoint as well. Therefore we have put together a 
proposal to NGI_TRUST open call. And we were very lucky to get the funding.

However, everything gets quite complex when it comes to meta-data. We 
need to keep such meta-data for every value of every data item. And the 
meta-data are going to be slightly different for every midPoint 
deployment. This adds an entirely new /dimension/ of data modeling, a 
new dimension of complexity. This is very hard to do with conventional 
data modeling languages. We might try to make XSD one more dirty trick – 
and after all these years of XSD hacking we might actually succeed. But 
we have decided that this is the point where we finally say good-bye to 
XSD and do it properly. We started by double checking 
<https://docs.evolveum.com/midpoint/midprivacy/phases/01-data-provenance-prototype/existing-languages-analysis/> 
that we are not missing any obvious solution. But there was no solution 
that could satisfy our needs.

That is how Axiom was born 
<https://docs.evolveum.com/midpoint/midprivacy/phases/01-data-provenance-prototype/axiom-notes/>. 
Axiom is a new data modeling language we are working on right now. It is 
still a baby, still wildly evolving. But it starts to take its shape 
<https://docs.evolveum.com/midpoint/midprivacy/phases/01-data-provenance-prototype/axiom/>. 
First ambition of Axiom is to replace XSD in midPoint. But that would 
not be enough to justify existence of the new language. We need Axiom to 
do more than that. Our goal is to use Axiom to define a /meta-data 
schema/. We want to maintain complex meta-data structures for every data 
value. The data will be modeled by Axiom schema, but also the meta-data 
will be modeled by independent Axiom schema. These schemas will be 
/orthogonal/, independently developed, independently maintained, 
independently extended and customized for every deployment. We want to 
join the schemas inside midPoint at run-time. This is a method how to 
create two-dimensional schema from two simple schemas without getting a 
code of insane complexity. This is the right way how to implement data 
provenance capabilities.

We are now working on prototype implementation of a processing code for 
Axiom and adjusting the Axiom language specification at the same time. 
We believe that something like Axiom cannot be designed on a drawing 
board or in a standards committee. This needs experimentation, 
prototyping and evolution. We are proceeding in iterations, using the 
midPoint code as a test bed. Therefore we expect that Axiom will be 
evolving for quite some time until it is completely ready. But we 
believe that this is a step in the right direction. This is more than 
likely to bring a lot of long-term benefits.

Finally, we are more than grateful for this opportunity and we would 
like to thank everyone in NGI for our chance to make another step 
towards robust and professional data protection platform that can be 
used by everybody. We appreciate that European Union is not just 
imposing data protection regulations, but that it is also contributing 
to open source technologies that can be used to implement practical data 
protection mechanisms. We are more than happy for this opportunity to 
push the technology one small step forward.

This project has received funding from the European Union’s Horizon 2020 
research and innovation programme under the NGI_TRUST grant agreement no 
825618.

(Reposted from Evolveum blog <https://evolveum.com/a-road-to-axiom/>)

-- 
Radovan Semancik
Software Architect
evolveum.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.evolveum.com/pipermail/midpoint/attachments/20200513/59d82633/attachment.htm>