Tuesday, November 09, 2010

Are the ISO 21090 Data Types Too Complex?

An interesting series of postings on this topic on the openEHR lists. Once again, it would seem, HL7 has succeeded in bringing about a situation in which an international standard is needlessly difficult to use.

As Thomas Beale puts it, while ISO 21090 ("Health Informatics — Harmonized data types for information interchange") presents itself as a general datatype specification, it is essentially an HL7 specification: 
No-one could possibly have come up with 21090 as it is today without the starting point being the HL7v3 data types. 
ISO 21090 is thus optimized only for HL7v3 messaging – which is hardly being used. It includes multiple attributes not useful to non-messaging users. And it is not defined in a normal object-oriented way. It is, accordingly, “a huge missed opportunity".

Some extracts from the discussion:

From: Thomas Beale
·         Date: Sat, 06 Nov 2010 21:35:36 
... my list of problems [with ISO 21090], from a cursory examination:

1. The model is defined [in such a way] that all data types inherit from HXIT and then ANY, which contain 7 attributes specific to HL7v3 messages. This means that any other types, such as BL (Boolean) inherits these attributes. This is a basic modelling error, since the normal approach is to separate context-specific attributes (e.g. specific to the use of data values in messages, but not other uses) into ‘wrapper’ classes. The practical effects of this modelling are twofold:
  • There is not a close correspondence between the 21090 idea of ‘ANY’ and the typical Any/Object or other root class of most object-oriented type systems – this name clash would have to be resolved in some way;
  •  an implementation of the 21090 data types is forced to have HL7v3 specific attributes in its base classes, and it also complicates the use of more orthodox modelling for such purposes;
  • alternatively, to produce a version of 21090 for use outside of HL7v3, a ‘profile’ of some kind has to be developed by ISO and/or CEN.
2. It includes ‘types’ for name and address that are really compositional structures, and would normally be considered to be archetypable or otherwise configurable structures consisting of lists, trees etc of primitive types (String, Integer etc); (this problem has been around forever in HL7. I was in CEN meetings in 2002 or 2003 when people were complaining about this. It might make sense for HL7, but it doesn't in more generic modelling frameworks)

3. It uses a modelling notion called ‘flavours’ defined via ‘common constraint patterns on existing datatypes’, whereby e.g. the timestamp type TS can be constrained to TS.CA.BIRTH, i.e. a variant used in Canada for recording birth dates. The problems with this approach include:
  • is that it is not supported in any standard industry UML or related tools (e.g. Eclipse Modelling Framework); (It is sort of doable in OO languages, but it breaks the normal spirit of OO modelling, and is not conducive to maintainability)
  • class-names containing the ‘.’ character are not legal in most type systems;
  • it is not generally known or understood by IT practitioners;
  • it is not clear how such ‘constrained types’ should be implemented in normal object-oriented development technologies;
  • it mixes the concept of localised constraint that would normally be defined outside of the software, with ‘hard’ data types that would normally be implemented in the software (e.g. TS would normally be implemented in software, but implementing ‘Canadian birthdate’ is likely to make software brittle).

4. Due to the above problem, date/time types typically needed in clinical data, and archetypes, are defined using types: TS.DATE, TS.DATETIME, although there is no match for the logical type ‘Duration’ or ‘Time’.

5. The error of including context-specific attributes within base types occurs elsewhere in the specification. To give two examples:
  • The type TEL (telecommunications address) includes the attribute ‘useablePeriod’, intended to indicate when the address is useable. Normally such a context attribute would be found within a context specific information structure representing ‘Contact’ or some other typical demographic concept in which not only the date range, but also type / purpose (e.g. ‘business’, ‘home’) might be recorded. 21090 forces it to be in every instance, although it presumably can be empty (as is likely in most instances).
  • The type II (instance identifier) includes the coded attribute ‘reliability’ which indicates whether the identifier was ‘issued by the system’, ‘verified by system’ or ‘unverified by system’.
The modelling style seems to follow the strange HL7 obsession with non-object orientation, popularised in the RIM. In summary, I don't see 21090 as being at all appropriate for the title of the standard, which is "Health Informatics — Harmonized data types for information interchange". Instead, it should just have been called "Data types for HL7-based messaging". It doesn't make sense as an ISO standard; it is really an HL7 standard.

*From: Eric Browne
*Date: Mon, 8 Nov 2010 13:34:56 +1030

… I'd like to add my  voice to Tom's concerns.

I certainly believe that the whole ISO process with respect to health informatics standards is deeply flawed. As Grahame [Grieve] implies with the datatypes standard, the process is politically driven and compromises in modelling, engineering, safety, implementability inevitably occur. The question is how significant are these compromises and what effect will they have on the evolution of e-health?

It is highly unlikely that we would have an ISO standard for "Health Informatics - Harmonized data types for information interchange" without the monumental effort of Grahame Grieve in producing and managing the draft. However, it is, first and foremost, an HL7 flavoured standard. The most recent draft I have seen is, according to its forward, "a shared document between Health Level Seven (HL7) and ISO". ISO 21090 is undoubtedly complex. One has to question the value of an international standard, if it is so complex that it has to be 'profiled' by different organisations before it can be used. By whom, for what purposes, and by what processes, will such profiling be managed?

ISO 21090 suffers some of the significant flaws that permeate much of HL7 specifications. Tom has already cited the peculiar inheritance hierarchy amongst others. Another engineering flaw is the pervasive use of cryptic, often ad hoc enumerations. Even the names of the types wouldn't pass muster in most quality engineering schools. Names like ENP, HXIT, CO, EN, EN.TN, CD.CV, URG are simply inexcusable. Levels of indirection never aid readability, and lead to difficulty in implementation and testing.

It is not necessarily sensible to compare openEHR datatypes with ISO 21090. They are designed for different purposes. openEHR datatypes underpin openEHR's reference implementation and archetype object models for building electronic health record software and so can be augmented by these additional artefacts, as described below. The ISO datatypes should be able to stand on their own in a diverse range of implementation environments. This is a much harder task, and bumps up against fundamental principles of information exchange, whereby the assumptions of participating systems need to be carefully considered. Constraints and constraint mechanisms are pivotal here.

A datatype embodies the "agreed" set of values and operations pertaining to that type. If an item of received data "211414" has been denoted to be of type integer, then the receiving system "knows" how to process it, and will process it differently than if it had been denoted as a date ( AKA TS.DATE in HL7/ISO/DIS 21090 HI-HDTII ).  Healthcare includes a very rich vocabulary, and text-based value sets are common in information exchange. A datatype for coded text, say, needs to convey the agreed set of values of that type. Let's firstly consider values for "severity of adverse reaction to medication". Ideally, both a sending and a receiving system needs to agree on the set of values - and may behave sub-optimally if one system uses the set { "undetectable", "mild", "moderate" } and the other uses the set { "mild", "moderate", "severe", "extreme", "almost inevitably fatal" } , even if these values all came from the same terminology. In other words, the sending and receiving system are not actually using the same datatypes in this case.

How do we deal with this in real systems? The United Kingdom's Connecting for Health program has addressed this in their HL7 V3-based models by carrying the constraint within the datatype - in the coding scheme's identifier. So rather than say the values come from some specific version of SNOMED CT, they constrain the values to a specific subset using a Refset Identifier. And this can be carried in instance data.
Now whilst ISO 21090 is capable of constraining text-based value sets, such constraints are often done by other means - particularly through conformance statements in non-computable documents, most notably HL7 CDA Implementation Guides. We are seeing plenty of this in the US, as a result of their Meaningful Use provisions. In these cases, the datatype does not necessarily carry the constraint. It almost invariably doesn't. This means that in such transactions, the receiving system has no way of knowing the true datatype - i.e. the set of values - for each such data item. The only way for such constraints to be known to the receiving system is through access to HL7 templates - thus violating THE principal tenet of HL7's RIM-based information exchange paradigm.

*From: Thomas Beale
*Date: Mon, 08 Nov 2010 21:18:40 +0000
 
On 08/11/2010 18:51, Grahame Grieve wrote: It seems remarkable to me that people think it's a problem that ISO 21090 needs to be profiled. Who would've guessed that a full standard that meets many requirements is simpler to implement if you profile out the features that reflect requirements you don't have? I'm pretty sure that this is true of every other standard as well. It's certainly true of all my implementations of W3C, IETF, and OMG standards.

I know that in HL7 this profiling is normal. The only kind of 'profile' I know of elsewhere in other standards is of the kind 'we only implement x, y but not z'. In other words, choosing a subset of classes or features to implement. As soon as one has to actually chop up the classes in a model however, we are on different ground. The answers Grahame gave me last time I discussed how to profile 21090 for 13606 use are here, about half-way down. As you can see, it was not 100% clear on a cursory inspection what exactly the profile version would look like. .... This means that official users of 13606, e.g. Sweden, can't actually use the standard out of the box, and do not have any official version to use until that work is done.
I happen to know that Sweden, Singapore and the UK have created at least 3 different 'profiles' of 21090 over time, all to suit their own needs. There is no guarantee that data or software built on these home-grown profiles will talk to each other, nor that any of them would talk with software or data built on the pure 21090 specification. So in fact, we have N pseudo-standards, and no real standard. This can't be anybody's idea of an easy way to get started with a data types standard.

… Note that I am not particularly making criticisms as if it were me personally trying to address the problems; I am mainly reflecting common responses from others, e.g. in government departments, universities and so on. There is no escaping from the fact that having a type called 'Any' representing a concept that should be called something like 'AnyDataValue' (in openEHR it is DV_ANY) is annoying and has to be dealt with in some way.

[It is sometimes said that] In health informatics, standards are done differently.

I have not been tracking other vertical industry ICT standards. But I did offer examples of 'stacks' of standards which do not follow the strange world of HL7 modelling. Everyone else uses normal OO modelling, or else something accepted like XML schema (admittedly terrible for object models, but that's another story); but HL7 can't (it instead tries to get OMG to change UML).  I fail to see why standards in e-health have to be done in such a bizarre way. There is nothing special about e-health requiring that.

*From: Thomas Beale
*Date: Tue, 09 Nov 2010 11:38:53 +0000

… RIM-based models are famously incomprehensible to people from all walks of life. Again, there are some people (including some clinicians) who understand them, and can author them, but they are a) not very intuitive and b) highly complex, for realistic examples. Due to the lack of basic data structures, e.g. the example of History/Events structure used in openEHR, such structures are avoided, or have to be manually created from Act / ActRelationship networks. The huge number of attribute nodes and code values also causes complications; I once calculated the value space of a single Act node with its 22 attributes to be 810 billion points. You can guess that the possible value space of a realistic RMIM is astronomical. This makes building models difficult. The traffic on the HL7 MnM list indicates the massive ongoing confusion around these models for a decade. If you don't believe me, try searching your archive simply for posts relating to 'context conduction'. If this modelling method were easy, everyone would be using it.

6 comments:

XML4Pharma said...

As an XML-guy, my major problem with all these 'standards' (as well HL7, ISO21090, OpenEHR) is that they all start from UML modeling and auto-generate the XML-schemas from that, ignoring any of the advanced features (such as native XML datatypes) of XML.
I have no problem with UML modeling, but I have major problems with the naïve believe that one can generate high quality XML-Schemas automatically from UML diagrams.
Essentially, these 'standards' are abusing the XML standard itself. They even manage to reinvent basic datatypes such as integer, date, time, duration which are already defined by XML and XML-schema itself. For example, HL7- and ISO21090 'date' is expressed as YYYYMMDD, where there is already a base XML-schema datatype 'date' expressed as YYYY-MM-DD. So when validating an HL7-v3-message against the schema (and even against the schematron), the date 20070231 (February 31, 2007) is accepted as a correct date. If they had respected the XML standard instead of abusing it, then the same data in correct XML (2007-02-31) would have immediately been rejected by XML-Schema as being an invalid date. The same applies for many other HL7- and ISO21090 datatypes (e.g. date, time, datetime, duration). Each time HL7 as well as ISO21090 'reinvents' these datatypes making life much more complicated than is necessary.
I am one of the developers of the CDISC ODM standard used in clinical research. We also use UML for modeling, but our schemas are not automatically generated from them, but created manually. We use native XML datatypes as much as possible, not trying to reinvent the wheel. We try to make our standard so that instance files are even understandable by non-specialist when looking at the XML itself (so without stylesheet). They are also 'human-readable'. The latter cannot be said of any HL7- or ISO21090 instance file.
Some time ago, our team was asked to enable ISO21090 in ODM. The request came from one of the largest nonprofit research organizations in the US. We had a teleconf with them and it soon became clear that they wanted us to replace our own (XML-Schema based) datatypes by the ISO21090 datatypes. We refused.
Instead, we will develop an ODM-extension that allows to attach ISO21090 data points (in their own namespace) to ODM 'ItemData' elements. Doing so, the ODM standard will support ISO21090 AND remain ODM.
Somewhat more than a year ago, I developed a stylesheet to extract information from HL7-CCD (health records) to prepopulate clinical forms in ODM format. This stylesheet was soon regarded as a key-enabler for integration between health records and clinical research. However, I cannot guarantee that it will work for ANY health record in CCD format. The reason is that CCD is so complex that I fear that one can put the same information (for example a systolic blood pressure) in very many ways in a CCD. So how can I guarantee that the systolic blood pressure can be extracted in all cases?
The ISO21090 'standard' is clearly a political compromise, not a technical compromise. As such, from the technical point of view it is probably not an improvement.
In my personal opinion, the best that can be done for a data standard for healthcare is to restart from nearly scratch. Yes, UML modeling can and should be used, but based on solid and agreed principles, and taking into account that XML will later be the transport format. So, no 'reinvention' of datatypes, but using the XML native datatypes right from the start. No fully-automated generation of XML-Schemas, but development of the schemas by schemawriters (though part of the work can be automated).
And most of all, involve all eligible players (HL7, ISO, OpenEHR, etc.) right from the start.

XML4Pharma said...

As an XML-guy, my major problem with all these 'standards' (as well HL7, ISO21090, OpenEHR) is that they all start from UML modeling and auto-generate the XML-schemas from that, ignoring any of the advanced features (such as native XML datatypes) of XML.
I have no problem with UML modeling, but I have major problems with the naïve believe that one can generate high quality XML-Schemas automatically from UML diagrams.
Essentially, these 'standards' are abusing the XML standard itself. They even manage to reinvent basic datatypes such as integer, date, time, duration which are already defined by XML and XML-schema itself. For example, HL7- and ISO21090 'date' is expressed as YYYYMMDD, where there is already a base XML-schema datatype 'date' expressed as YYYY-MM-DD. So when validating an HL7-v3-message against the schema (and even against the schematron), the date 20070231 (February 31, 2007) is accepted as a correct date. If they had respected the XML standard instead of abusing it, then the same data in correct XML (2007-02-31) would have immediately been rejected by XML-Schema as being an invalid date. The same applies for many other HL7- and ISO21090 datatypes (e.g. date, time, datetime, duration). Each time HL7 as well as ISO21090 'reinvents' these datatypes making life much more complicated than is necessary.
I am one of the developers of the CDISC ODM standard used in clinical research. We also use UML for modeling, but our schemas are not automatically generated from them, but created manually. We use native XML datatypes as much as possible, not trying to reinvent the wheel. We try to make our standard so that instance files are even understandable by non-specialist when looking at the XML itself (so without stylesheet). They are also 'human-readable'. The latter cannot be said of any HL7- or ISO21090 instance file.
Some time ago, our team was asked to enable ISO21090 in ODM. The request came from one of the largest nonprofit research organizations in the US. We had a teleconf with them and it soon became clear that they wanted us to replace our own (XML-Schema based) datatypes by the ISO21090 datatypes. We refused.
Instead, we will develop an ODM-extension that allows to attach ISO21090 data points (in their own namespace) to ODM 'ItemData' elements. Doing so, the ODM standard will support ISO21090 AND remain ODM.
Somewhat more than a year ago, I developed a stylesheet to extract information from HL7-CCD (health records) to prepopulate clinical forms in ODM format. This stylesheet was soon regarded as a key-enabler for integration between health records and clinical research. However, I cannot guarantee that it will work for ANY health record in CCD format. The reason is that CCD is so complex that I fear that one can put the same information (for example a systolic blood pressure) in very many ways in a CCD. So how can I guarantee that the systolic blood pressure can be extracted in all cases?
The ISO21090 'standard' is clearly a political compromise, not a technical compromise. As such, from the technical point of view it is probably not an improvement.
In my personal opinion, the best that can be done for a data standard for healthcare is to restart from nearly scratch. Yes, UML modeling can and should be used, but based on solid and agreed principles, and taking into account that XML will later be the transport format. So, no 'reinvention' of datatypes, but using the XML native datatypes right from the start. No fully-automated generation of XML-Schemas, but development of the schemas by schemawriters (though part of the work can be automated).
And most of all, involve all eligible players (HL7, ISO, OpenEHR, etc.) right from the start.

Spero melior said...

Not to mention the fact that the YYYY-MM-DD format itself is an ISO standard (ISO 8601). I thought ISO standards had to be compatible with what came before them. How can ISO 21090 get away with changing YYYY-MM-DD? If we cannot even standardize date formats...

thomasbeale said...

Re the above comment, it is surprising that anyone would want to try to do any modelling in XML schema, if there are any inheritance relationships (which there will be if there is any reuse going on). XML schema is a disaster for this, as I confirmed on a recent consulting job. James Clark, designer of Relax NG, sees it as a design flaw (from http://www.thaiopensource.com/relaxng/design.html#section:15 ); also see http://www.xml.com/pub/a/2002/11/20/schemas.html?page=4#restriction .

The only way to use XSDs is in fact to generate them from proper object models, which implement proper inheritance semantics, and do all the real modelling in the object domain. This is not UML diagrams, but proper computational languages, like openEHR archetypes, Eclipse EMF and so on.

XML4Pharma said...

Thomas has misunderstood my post: we do not use XML-Schema for modelling. We do use proper object models (and sometimes use UML) for modelling and write the XML-Schemas from these models. If you download the XML-Schemas for CDISC ODM (http://www.cdisc.org/odm) you will see that there is a lot of inheritence and a very high amount of reuse. That is because we write the schemas by hand from the object model. We also started writing schematrons for those rules that cannot be enforced by XML-Schema.
What we do however is use schema types for our base types as much as possible. So if we need a date, time or datetime, we say "well, we may as well use xs:date, xs:time, xs:datetime for that". If we need a time period between two events or activities, we say: "we may as well use xs:duration for that". This instead of inventing our own date, time, datetime or duration format, as HL7 does.

What I pleaded for in my post is essentially two things:
a) if you automatically create schemas from UML, do not trust these schemas to be of high quality - you will highly probably need to improve them by hand.
b) for your base types, use schema types as much as possible. Do not try to reinvent the wheel.

I admit that the model for CDISC ODM (limited to clinical research) is not so complex as one needed for healthcare in general. So writing schemas by hand (as we do) may be challenging for healthcare information models. In such a case, a hybrid approach (auto-generated schemas followed by improving them by XML-Schema specialists) may be suitable.
This requires of course to have some schema specialists, which does not seem to be the case for the HL7 organisation.

thomasbeale said...

We may have been at cross-purposes, apologies for my possible misunderstanding. I should have said that I don't see UML tools as being useful for serious generation of any output, whether code or XSD, and I suspect XML4Pharma thinks the same. Certainly in our experience with very well known tools, the output ranged from naive to completely broken (particularly where generic types are concerned).

I should also have added that I agree with the point about using the built-in types. If you are going to use a formalism (and XSD is unavoidable at some point), then you should use its built-in capabilities. Re the general problem with HL7 and ISO 21090 data types, see the openehr-technical mailing list for a long and ongoing thread on this - http://www.openehr.org/mailarchives/openehr-technical/threads.html . The basic problem is that HL7 just doesn't know how to do normal object modelling.