Open Rules for CDISC Standards as XQuery Web services

The FDA and CDISC have been publishing rules for validating submission datasets (SDTM, SEND, ADaM).
Unfortunately, these rules have been published as pure text (usually in Excel Worksheets), meaning that:

Each of the implementors can (and probably) will have a different interpretation of the rules
Implementors may mininterprete some of the rules, leading to incorrect implementations (e.g. leading to many false positives)
Most of the implementors implement the rules in software, but only the compiled code is made available - one cannot check whether the rules have been implemented correctly
Even when the source code was made public, it is very hard to find out how a rule was implemented, as one needs to know and understand the relations between the different classes, as the rules are not in a central repository

A better way is to publish such rules in an open, vendor-neutral, human-readable but machine-executable format. Such a format is XQuery.

XQuery is an open W3C standard for quering XML documents and databases.

To start with, we implemented almost all of the FDA-SDTM rules, the PMDA-SDTM rules, and the CDISC-SDTM rules (SDTM-IG 3.2) in XQuery, and have started to develop the XQueries for the "CDISC ADaM v1.3 Validation Checks" and for the SDTM rules defined by the "SDTM Validation Subteam".

As XQuery is a query language for XML, you can use and test the rules on Dataset-XML submission files. If you are still working with the outdated SAS Transport 5 format, you can easily transform your SAS Transport 5 files to Dataset-XML files with one of the tools provided by CDISC.

We also make a RESTful web service available for these rules written in XQuery, meaning that:

You (or better:your application) always get the latest version of the rules (the last update date is indicated)
You can query the repository for updates since a specific data
You can create your own repository and let it query the system for the latest updates
You can request the latest version of a rule directly from within your application without any human interventions

A presentation at the German-Speaking User Group meeting about the power of XQuery for really open SDTM/ADaM/SEND validation rules (mostly non-technical)

A movie (for techies - no sound yet) about how easy it is to implement ADaM Validation Rules in XQuery

The developed XQuery rules are currently purely EXPERIMENTAL, and should not yet be used for production

How it works

The best is to have your Dataset-XML files, including your define.xml file (v.2.0) in a native XML database such as eXist or BaseX. The reason is that you can then index your database so that your queries will run considerably faster.
If you don't have a native XML database or do not know how to work with such, you can also just use a file system (i.e. directory). We will explain in a moment how to proceed in that case.

Getting the XQuery SDTM rules using the web service

We have implemented the following RESTfull web services (many more will follow)

http://www.xml4pharmaserver.com:8080/CDISCSDSRulesService/rest/ruleAsXQuery/{ruleID}
For example: to retrieve the XQuery for the FDA-SDTM rule FDAC050 (Exposure end date may not be after the latest Disposition date) use:
http://www.xml4pharmaserver.com:8080/CDISCSDSRulesService/rest/ruleAsXQuery/FDAC050
Depending on whether you or your application did an HTTP "text" or "XML" request, the server will return the XQuery as text or as XML.
Or to retrieve the XQuery for the PMDA-SDTM rule SD1036 use:
http://www.xml4pharmaserver.com:8080/CDISCSDSRulesService/rest/ruleAsXQuery/SD1063
Or to retrieve the XQuery for the CDISC-SDTM rule CG0468 use:
http://www.xml4pharmaserver.com:8080/CDISCSDSRulesService/rest/ruleAsXQuery/CG0468

http://www.xml4pharmaserver.com:8080/CDISCSDSRulesService/rest/rulesUpdatedSince/{originator}/{standard}/{date}
For example: to retrieve all FDA-SDTM rules updates since June 1st, 2016, use:
http://www.xml4pharmaserver.com:8080/CDISCSDSRulesService/rest/rulesUpdatedSince/FDA/SDTM/2016-06-01
This kind or request will always return an XML document containing all XQuery rules updated on the server since June 1st 2016.
Or to retrieve all the updates of the CDISC SDTM rules sind March 1, 2017, use:
http://www.xml4pharmaserver.com:8080/CDISCSDSRulesService/rest/rulesUpdatedSince/CDISC/SDTM/2017-03-01

http://www.xml4pharmaserver.com:8080/CDISCSDSRulesService/rest/allRulesAsXQuery/{originator}/{standard}
in order to retrieve (the latest version of) ALL the currently developed rules as XQuery - which can be a lot (your browser might protest, so best do it from within an application)

http://www.xml4pharmaserver.com:8080/CDISCSDSRulesService/rest/rulesFromDomainOrClass/{originator}/{standard}/{domainOrClass}
Retrieves all rules for a specific domain or class (like "TA","AE","LB","FINDINGS","EVENTS","INTERVENTIONS")
For example: to retrieve all FDA-SDTM rules for the EX (exposure) domain, use:
http://www.xml4pharmaserver.com:8080/CDISCSDSRulesService/rest/rulesFromDomainOrClass/FDA/SDTM/EX
Similarly, to retrieve all CDISC-SDTM rules for the EC (exposure as collected) domain, use:
http://www.xml4pharmaserver.com:8080/CDISCSDSRulesService/rest/rulesFromDomainOrClass/CDISC/SDTM/EC
TODO: When using a class like "FINDINGS", the system will not find the rules yet that are specific for a single domain (like "LB"), only those for that are applicable to all domains in the class.
Or when you want to retrieve all the CDISC ADaM v1.3 Validation Checks regarding the BDS classes, use:
http://www.xml4pharmaserver.com:8080/CDISCSDSRulesService/rest/rulesFromDomainOrClass/CDISC/ADaM/BDS

XQuery explained

We will take the rule FDAC050 (Exposure end date may not be after the latest Disposition date) as an example. The screenshots you see come from eXide, the graphical user interface of the eXist native XML database, but you can of course also execute the XQuery rules using any modern application designed for it such as written in Java, C#, C++, ..., or you can easily write your own custom application, using one of the published XQuery libraries such as XQJ (XQuery Processor for Java).

Let us first inspect the first few lines of our XQuery:

Comments in XQuery start with an "(:" and end with a ":)". So the two first lines are just comments - they are not processed.

Line 3 states with which version of XQuery we are working with - all the rules were written using v.3.0
Lines 4-8 declare all the XML namespaces that we will be working with and their prefixes: you will immediately identify the ODM and the Define-XML namespace, as well as the Dataset-XML namespace.

The following lines are:

Line 12 defines where the base (directory or database collection) of all our XML documents is. In our case, we stored the XML documents in the eXist native XML database in the collection "/db/fda_submissions/cdisc01". Also the FDA could easily set up a native XML database and store each submission in a different "collection" - that would be much less expensive than the Janus warehouse! In case you work with a normal file system, the statement will be something like: "file:///d:/fda_submissions/cdisc01". It can however also be a URL like: http://mycompany.com/fda_submissions/cdisc01.

Line 13 is very important, as it defines the name of the define.xml file which is assumed to be in the base directory or collection together with all submission files. This is important, as the define.xml is "leading", i.e. all the information about where the submission files are located, how they are named, what variables with which datatype and length are present, which codelists are present, etc..
This is exactly how it should be the case, a fact otherwise often ignored by other validators

If you have written your own application to execute the XQuery rules, you will probably want to pass the base location and the name of the define.xml as parameters: this is explained (but commented out here) in lines 9-11.
We will later explain how one can set these values from a Java programm.

The following lines are:

In lines 15, it is read from the define.xml document where the document with data for the DS domain can be found (define.xml is leading!), and as this is a relative path, the full path is generated by concatenating the base and the DS document name.

The following lines are:

This a bit more difficult, as a subquery is used here ...
Essentially what is done here is that the ODM-OIDs of the variables 'USUBJID' and 'DSSTDTC' in the DS dataset are identified as, as you know, datapoints in Dataset-XML are identified with their OIDs, which are defined in the define.xml file (define.xml is leading!).
So the outcome for $dsusubjoid and $dsstdtcoid will be something like "DS.USUBJID" and "DS.DSSTDTC" depending on how they are defined in the define.xml document. These two variables will enable to find the data points in the Dataset-XML document.

The following lines are:

First, the location of the EX dataset is located from the define.xml (lines 28-30), and the OIDs of the USUBJID variable in EX is defined (it is not necessarily the same as for USUBJID in DS - lines 31-35).
Then the OID of the EXENDTC variable in EX is determined (line 36). A subquery is not necessary here, as the variable EXENDTC is expected to only appear in the EX doamin.

The following lines are:

The "for" statement (line 38) iterates over all the records in the EX dataset, where an "ItemGroupData" corresponds to a single record. Within each record, the record number is identified by the Dataset-XML "ItemGroupDataSeq" attribute (line 39).
Lines 41 and 42 then retrieve the value for USUBJID and EXENDTC from the "Value" attribute of the appropriate "ItemData" element (each "ItemData" represents a single data point).

This is followed by:

All DSSTDTC values for the selected subject are looked up in the DS dataset, and the highest (maximal) value is selected and stored in the variable $latestdate.

The following lines:

The EXENDTC is compared (as a date) with the value of $latestdate (which is the latest disposition date for the specific subject) in the "where" statement , and if EXENDTC is beyond the latest disposition date, an XML "warning" element is returned, containing information about the record number (in the EX dataset), the observed value for EXENDTC, the USUBJID and the latest disposition date. Remark that also the "last update date" of the rule itself is returned, allowing implementors to see whether the rule (or at least its implementation) were changed.

In our native XML database, the rule executes in about 0.2 seconds for a small submission, in 40 seconds for a large submission.

For more examples with explanations, see our "CDISC Validation Rules in XQuery"

Advantages of this approach

The advantage of having the validation rules as XQuery and using the web service are obvious:

All the rules are transparent: you can easily inspect the validation code and correct/improve it if you feel it is wrong (please let us know)
You can yourself decide for which rules you are doing the validation, and for which you don't (no "all or nothing")
Using the webservice, you always get the latest version of the rule (no need to "wait until the next release")
With a little XQuery knowledge, you can easily add your own company-specific rules (try that with other validation software...)

Disadvantages of this approach

There are many advantages of this approach (vendor neutral, open standard, human- as well as machine-readable, ...) but there are also a few disadvantages:

The approach only works with Dataset-XML, not with SAS Transport 5. The latter can however barely be designated as an open, vendor-neutral format. As however already stated, CDISC provides tools to convert into Dataset-XML
Each rule is in a separate file, and XQuery can be slower than compiled code, as the XQuery processor must first compile the XQuery. It is however also possible to store compiled XQuery in a repository or cache it. The effect of the latter can easily be seen when using eXide in eXist: each time the same query is executed, it is faster.

Implementation: passing parameters to the XQuery

Passing parameters from your own applicattion to the XQuery is explained in our "XQuery Implementation" page.

Running the XQueries using software

You will usually want to run the XQueries using a validation softare (similar to OpenCDISC). Writing such validation software is pretty straightforward, and is explained on our "XQuery Implementation" page.

Further remarks

We did use a native XML database (eXist) and indexed it using the OID attributes of ItemGroupData and ItemData
We did however not do many attempts yet to optimize the queries for speed. This is "work in progress"
You will not yet find ALL the FDA rules in our repository, also here this is "work in progress", we are regularly extending the set (remark that some of the FDA rules cannot be implemented as they are nonsense, "just damned wrong", not a rule (but an expectation), or not understandable...
Rules that are about MedDRA coding have been partially implemented. These rules allow to pass the location (folder) of your MedDRA files. See the XQuery Implementation website for more details.
Please use the XQuery rules, test them, improve them, optimize them and ... provide feedback. We hope to soon start a "real open community" where developers can add rule updates in such a way that they are immediately visible and available for everyone

The future

With the first set of rules (CDISC-FDA-PMDA-SDTM/ADaM) implemented as fully transparent, machine-readable and -executable XQuery rules, we have started implementing them in the XML4Pharma SDTM-ETL(TM) mapping software. Also, the rules have been implemented in the open-source "Smart Dataset-XML Viewer". This will enable pharma companies, CROs and service providers to get rid of the currently used validators that have non-transparent rules implemented, with many false-positives, and to which no company-specific rules can be added, at least not in the free version.

Courtesy of XML4Pharma - last update: March 2017